Research

select * from code;

Use SQL to query and scan infrastructure as code for insights and security best practices.

Steampipe Team
5 min. read - Oct 04, 2023
Use SQL to query and scan infrastructure as code for insights and security best practices.

Infrastructure as code (IaC) has unleashed a Cambrian explosion of tools and techniques for modeling IT infrastructure and the policies governing it. From Terraform to Kubernetes to policy-as-code, we now define our desired infrastructure states declaratively. This evolution has enabled infrastructure automation, validation, and self-service. But it has also created a bewildering landscape of complex IaC formats and concepts. Terraform plans, Kubernetes manifests, and CloudFormation templates all model infrastructure differently. Under the hood though, there is a unified data model for IaC that represents cloud infrastructure as structured data.

With the right lens you can extract a coherent graph of resources, relationships, and metadata — not only from IaC definitions, but also from live deployments. Steampipe is that lens: a single binary that plugs into your CI/CD pipelines, taps into the entire Turbot ecosystem of API connectors and security checks, and enables ad-hoc queries against both IaC resources and deployed infrastructure. Do it all without having to script multi-lingual conversations with APIs and CLIs. Like everything else about IaC, the language you use to reason over it should be declarative. Steampipe's core language, SQL, meets the need.

Query infrastructure as data

Let's look at some basic examples across the full range of the common data model, starting with this snippet of AWS-flavored Terraform.

resource "aws_instance" "Turbot Bastion Host" {
ami = "ami-a1b2c3d4e5f6789ab"
instance_type = "t2.micro"
tags = {
Name = "Turbot Bastion Host"
Environment = "Prod"
}
}
resource "aws_instance" "Silverwater Bastion" {
ami = "ami-0f9e8d7c6b5a43210z"
instance_type = "t2.micro"
tags = {
Name = "Silverwater Bastion"
Environment = "Dev"
}
}

This query finds two rows in answer to the question "What are the AMIs for our bastions?"

select
name,
arguments ->> 'ami' as image_id
from
terraform_resource
where
arguments -> 'tags' ->> 'Name' ~ 'Bastion'
+---------------------+------------------------+
| name | image_id |
+---------------------+------------------------+
| Silverwater Bastion | ami-0f9e8d7c6b5a43210z |
| Turbot Bastion Host | ami-a1b2c3d4e5f6789ab |
+---------------------+------------------------+

Once the instances are deployed, this query asks the same question of the live infrastructure. It returns the same rows, augmented with facts (instance_id, instance_state) known only at runtime.

select
title as name,
image_id,
instance_id,
instance_state
from
aws_ec2_instance
where
tags ->> 'Name' ~ 'Bastion'
+---------------------+------------------------+---------------------+----------------+
| name | image_id | instance_id | instance_state |
+---------------------+------------------------+---------------------+----------------+
| Silverwater Bastion | ami-0f9e8d7c6b5a43210z | i-0957d10a9e54d560e | running |
| Turbot Bastion Host | ami-a1b2c3d4e5f6789ab | i-0b42074c10ba42ff0 | running |
+---------------------+------------------------+---------------------+----------------+

The two queries are not exactly the same, but they share a common model that abstracts both parsing static Terraform resources and making API calls to live infrastructure.

Now let's look through a GCP-flavored Kubernetes lens. Here's part of the values.yaml file for a helm chart.

images:
repository: gcr.io/google-samples/microservices-demo
adService:
create: true
name: adservice
resources:
requests:
cpu: 200m
memory: 180Mi
limits:
cpu: 300m
memory: 300Mi

This query uses the Kubernetes plugin to ask the helm chart: "What are the resources specified for the adservice app?"

select distinct
key_path,
value
from
helm_value
where
key_path::text ~ 'adService.resources'
+-------------------------------------+-------+
| key_path | value |
+-------------------------------------+-------+
| adService.resources.limits.cpu | 300m |
| adService.resources.limits.memory | 300Mi |
| adService.resources.requests.cpu | 200m |
| adService.resources.requests.memory | 180Mi |
+-------------------------------------+-------+

This query also uses the Kubernetes plugin to ask the same question of the corresponding live Kubernetes deployment.

select
distinct
namespace,
jsonb_pretty(template -> 'spec' -> 'containers' -> 0 -> 'resources') as resources
from
kubernetes_deployment
where
name ~ 'adservice'
+------------------------+---------------------------+
| namespace | resources |
+------------------------+---------------------------+
| gke-microservices-demo | { |
| | "limits": { |
| | "memory": "300Mi" |
| | }, |
| | "requests": { |
| | "memory": "180Mi" |
| | } |
| | } |
+------------------------+---------------------------+

This query asks: "Which node is running the app?"

with images as (
select
name,
jsonb_array_elements_text(image.value->'names') a
from
kubernetes_node,
jsonb_array_elements(images) as image
)
select
name
from
images
where
image_name ~ 'microservices-demo' limit 1
+-----------------------------------------------------+
| name |
+-----------------------------------------------------+
| gke-smyth-test-cluster-1-default-pool-42aa2e29-7264 |
+-----------------------------------------------------+

And now, switching from the Kubernetes plugin to the GCP plugin, this query asks: "What are the instance details?"

select
machine_type_name,
status,
zone_name,
network_interfaces -> 0 ->> 'networkIP' as networkIP
from
gcp_compute_instance
where
name = 'gke-smyth-test-cluster-1-default-pool-42aa2e29-7264'
+-------------------+---------+---------------+-------------+
| machine_type_name | status | zone_name | networkip |
+-------------------+---------+---------------+-------------+
| e2-medium | RUNNING | us-central1-c | 10.128.0.48 |
+-------------------+---------+---------------+-------------+

Check infrastructure as data

Because the common model enables queries against both statically-defined resources and deployed resources, compliance mods can check both types. For example, this Terraform code defines two AWS S3 buckets, one with logging and one without.

provider "aws" {
region = "us-east-1"
}
resource "aws_s3_bucket" "my_bucket_1" {
bucket = "jon-test-project-foo-bucket-01"
logging {
target_bucket = aws_s3_bucket.my_bucket_2.bucket
target_prefix = "log/"
}
}
resource "aws_s3_bucket" "my_bucket_2" {
bucket = "jon-test-project-foo-bucket-02"
}

The AWS Compliance mod includes a control (s3_bucket_logging_enabled) that checks to ensure buckets will be logged. If we deploy these buckets and run the control, we'll see that bucket 2 is in the Alarm state.

But why even let that happen? The corresponding Terraform AWS Compliance mod also includes an s3_bucket_logging_enabled control. We can use that to detect the misconfiguration and fix it before the buckets ever get deployed.

Mods that check statically-defined resources have long been available for both Terraform (AWS, Azure, GCP) and Kubernetes. The queries that power all these controls exploit the power of the common model: it's just SQL, whether you're reasoning over static resources or live APIs.

Now we've broadened the scope of queries against static resources. In Interrogate Terraform we showcase a new capability: you can now query plan files and state files along with .tf files. And in Interrogate Kubernetes we discuss the analogous new ability to query helm charts and manifests along with live clusters. These new query targets make it even easier to reason over static resources and compare them to those you've deployed. It all follows logically from a unified model that abstracts the differences between static resources and live infrastructure.

GitOps for infrastructure as data

You've long been able to install Steampipe in CI/CD pipelines and use it to run checks and queries. That becomes even more powerful now that you can query and check Terraform plans and state files, as well as Kubernetes charts and manifests. But wait, there's more! We're also announcing new GitHub actions that make it easier than ever to use Steampipe, along with plugins and mods, in GitHub workflows. Use the Setup action to install Steampipe itself, along with any of the plugins you'll need. Then use the Check action to install mods that run your checks.

The Check action has a special trick up its sleeve. If a compliance mod detects a misconfiguration in a static resource that's included in pull request, the action will annotate the PR with the alarm reported by the control — at the line number in the resource file that triggered the alarm! Again, all this follows logically from a unified model that enables common declarative query over static resource and live APIs.

select * from cloud join select * from code

Steampipe's original mantra was select * from cloud. We thought it was amazing to able to query within and across live APIs using basic SQL, and those of you who've experienced that tell us you agree. In an IaC world, though, resources you can see via APIs are defined in code, so we made it possible to query the code too. In three companion posts — Interrogate Terraform, Interrogate Kubernetes, and Steampipe-powered GitHub actions — we explore the latest developments enabled by an IaC data model that unifies everything.

Questions? Comments? Bring them to the Turbot community! We'd love to know your thoughts.