v0.10.0: Better concurrency and caching →
Case Study

Using SQL to check spreadsheet integrity

The CSV plugin brings spreadsheet data to Steampipe. We show how to write compliance checks for that data.

Jon Udell
7 min. read - November 3, 2021

We have a love/hate relationship with our spreadsheets. We love them because they're convenient, and we hate them because they're uncontrolled. With the 0.9 release of Steampipe we can now apply integrity checks to our spreadsheet data. The new CSV plugin is the enabler for this magic. Building on dynamic schema support in Steampipe, the CSV plugin is first of a new kind of plugin that maps data to tables without requiring predefined schemas.

Fixed vs dynamic schemas

Previously, every Steampipe plugin worked with a fixed schema. The Slack plugin, for example, consumes Slack's REST API to tables using definitions from the golang API wrapper for Slack. That wrapper defines a User with a fixed set of fields: ID, TeamID, Name, etc. The Slack plugin maps those fields to the columns you see when you select * from slack_user.

The CSV plugin doesn't need such fixed schemas: it builds them on the fly. That means you can acquire tables from any source that exports data to CSV files. You can query these separately; you can join across them; and you can join CSV-sourced tables with fixed-schema tables provided by other plugins.

Granular resource tagging

Suppose you're using Steampipe to check resource tags against a vocabulary of mandatory tags. In our 0.8 release we introduced a way to specify that list of tags when running one of the tagging control mods for AWS, Azure, or GCP. In the AWS context, for example, you can do this.

steampipe check benchmark.mandatory --var \
'mandatory_tags=["Application", "Environment", "Department", "Owner"]'

That command runs 71 controls, each of which interpolates the list of tags passed on the command line, and checks each of associated resources for those tags.

What if you want to augment the general rule with checks for specific tags associated with specific resources? You can now record a mapping from resources to resource-specific tags in a spreadsheet, export it to CSV, and write new controls that join against the mapping table defined in that CSV file.

Integrity checks for spreadsheets

Beyond augmenting existing Steampipe queries and controls, you can now write new ones to explore and validate the data in critical spreadsheets. For example, here's a common scenario: an event planner uses a spreadsheet to keep track of people and sessions. The spreadsheet has two tabs like so.

person,sessions
Mary,"101, 102"
Frank,101
Joe,"103, 104"
Sasha,101
session,time
100,11/18/2021 9:00
101,11/18/2021 9:00
102,11/18/2021 9:00
201,11/18/2021 10:00
202,11/18/2021 10:00

To work with this data in Steampipe, export the spreadsheet tabs to files: people.csv and sessions.csv. Then install the CSV plugin (steampipe install plugin csv), edit ~/.steampipe/config/csv.spc, and set paths to include to the folder containing the files (in this case, ~/csv).

connection "csv" {
plugin = "csv"
paths = [ "~/*.csv" ]
}

Now, when you run steampipe query, those files magically appear as tables in the csv namespace.

Writing custom controls to check data integrity

As is typical in such scenarios, nothing guarantees that the session numbers are valid. Let's say that the valid sessions are 101-105 and 201-205. Here is a control that checks that condition.

control "sessions_valid_in_session_table" {
title = "Sessions in the sessions table are valid"
sql = <<EOT
select
'event planner' as resource,
case
when (
session = any(array['101','102','103','104','105','201','202','203','204','205'])
) then 'ok'
else 'alarm'
end as status,
'Sessions in the sessions table are valid' as reason
from
csv.sessions
EOT
}

To "deploy" this control we'll create a directory, event_planning, and in it place a file, event_planning.sp, containing the above code. This directory is implicitly a workspace in which Steampipe can run both named queries and mods. Here is our definition of a mod:

A Steampipe mod is a portable, versioned collection of related Steampipe resources such as queries, controls, and benchmarks. Steampipe mods and mod resources are defined in HCL, and distributed as simple text files. Modules can be found on the Steampipe Hub, and may be shared with others from any public git repository.

Mods can be very elaborate. The AWS compliance mod, for example, comprises 10 benchmarks that encapsulate 433 controls and 224 named queries.


But they can also be very simple. If our event_planning.sp file contains only the code shown above, we can do this to find invalid sessions in the sessions table.

~/event_planning$ steampipe check all

Checking a more complex table

We also want to know if the sessions mentioned in the people table are valid. In order to do that, we'll need to project each list of sessions to a set of rows.

with person_session as (
select
person,
string_to_array(
regexp_replace(sessions, '\s', '', 'g'),
','
) as sessions
from csv.people
)
select
person,
unnest(sessions) as session
from
person_session
+--------+---------+
| person | session |
+--------+---------+
| Joe | 103 |
| Joe | 104 |
| Mary | 101 |
| Mary | 102 |
| Frank | 101 |
| Sasha | 101 |
+--------+---------+

The regexp_replace function strips whitespace so that each session id won't include any spaces. The string_to_array function then turns that string value into an array. These steps will be familiar to everyone whose used Python, or another language, to clean and reshape data.

If you're coming from Python, you can read regexp_replace as re.sub and string_to_array as split. The unnest function, though, will be less familiar. It projects a Postgres array to rows, so we end up with multiple rows for Joe and Mary.

Given this setup, here's a control that checks for valid sessions in the sessions table. We use select distinct because we don't need to check duplicate occurrences of a session.

control "sessions_valid_in_people_table" {
title = "Sessions in the people table are valid"
sql = <<EOT
with person_session as (
select
person,
string_to_array(
regexp_replace(sessions, '\s', '', 'g'),
','
) as sessions
from csv.people
),
unnested as (
select distinct
unnest(sessions) as session
from
person_session
)
select
'event planner' as resource,
case
when (
session = any(array['101','102','103','104','105','201','202','203','204','205'])
) then 'ok'
else 'alarm'
end as status,
'Session ' || session as reason
from unnested
EOT
}

We can now run either of these controls individually, using steampipe check control.sessions_valid_in_sessions_table or steampipe check control.sessions_valid_in_people_table. Or we can simply use steampipe check all to run all the control in this workspace and report all the results with a combined summary.

~/event_planning$ steampipe check all

Of course we'd rather not write the list of sessions twice. So we'll define them using a variable declared at the top of event_planning.sp.

variable "valid_sessions" {
type = list(string)
default = ["101","102","103","104","105","201","202","203","204","205"]
}

The values are double-quoted, JSON-style. Now we can refer to that variable from each of the controls. Here's the first one, revised accordingly.

control "sessions_valid_in_session_table" {
title = "Sessions in the sessions table are valid"
sql = <<EOT
select
'event planner' as resource,
case
when (
session = any( $1 )
) then 'ok'
else 'alarm'
end as status,
'Session ' || session as reason
from
csv.sessions
EOT
param "valid_sessions" {
default = var.valid_sessions
}
}

There can be more than one param argument corresponding to more than one variable. In that case, the order of the variable declarations does not matter, but the order of the param arguments does. If valid_sessions were a second param argument then the placeholder for its interpolation into the query would be $2.

Alternative outputs

Along with dynamic schemas, the 0.9 release added support for two new flavors of control output: HTML and Markdown. To produce HTML output:

steampipe check all --export=event_planning.html

Conclusion

Dynamic schemas vastly expand the universe of data available to Steampipe. The CSV plugin was the first to exploit the new capability, but already there's another. With the AirTable plugin, contributed by François de Metz, you can bring tables from that system into Steampipe.

If there weren't an AirTable plugin, of course, you could export to CSV and use the CSV plugin. CSV is truly the lingua franca of data exchange; all kinds of systems and apps can export it; the folks at GitTables are annotating and studying 1.7 million tables extracted from CSV files in GitHub!

You can use CSV-sourced data to augment queries and controls that work with Steampipe's growing collection of plugins. But as we've seen here, you can also write simple controls to check the integrity of critical data that's managed, for better and worse, in spreadsheets. Our Steampipe samples repo has the ingredients shown here. Install the CSV plugin, run the controls, then show us how you've used this method to check the integrity of one of your spreadsheets!