We have a love/hate relationship with our spreadsheets. We love them because they're convenient, and we hate them because they're uncontrolled. With the 0.9 release of Steampipe we can now apply integrity checks to our spreadsheet data. The new CSV plugin is the enabler for this magic. Building on dynamic schema support in Steampipe, the CSV plugin is first of a new kind of plugin that maps data to tables without requiring predefined schemas.
Fixed vs dynamic schemas
Previously, every Steampipe plugin worked with a fixed schema. The Slack plugin, for example, consumes Slack's REST API to tables using definitions from the golang API wrapper for Slack. That wrapper defines a User
with a fixed set of fields: ID
, TeamID
, Name
, etc. The Slack plugin maps those fields to the columns you see when you select * from slack_user
.
The CSV plugin doesn't need such fixed schemas: it builds them on the fly. That means you can acquire tables from any source that exports data to CSV files. You can query these separately; you can join across them; and you can join CSV-sourced tables with fixed-schema tables provided by other plugins.
Granular resource tagging
Suppose you're using Steampipe to check resource tags against a vocabulary of mandatory tags. In our 0.8 release we introduced a way to specify that list of tags when running one of the tagging control mods for AWS, Azure, or GCP. In the AWS context, for example, you can do this.
steampipe check benchmark.mandatory --var \'mandatory_tags=["Application", "Environment", "Department", "Owner"]'
That command runs 71 controls, each of which interpolates the list of tags passed on the command line, and checks each of associated resources for those tags.
What if you want to augment the general rule with checks for specific tags associated with specific resources? You can now record a mapping from resources to resource-specific tags in a spreadsheet, export it to CSV, and write new controls that join against the mapping table defined in that CSV file.
Integrity checks for spreadsheets
Beyond augmenting existing Steampipe queries and controls, you can now write new ones to explore and validate the data in critical spreadsheets. For example, here's a common scenario: an event planner uses a spreadsheet to keep track of people and sessions. The spreadsheet has two tabs like so.
person,sessionsMary,"101, 102"Frank,101Joe,"103, 104"Sasha,101
session,time100,11/18/2021 9:00101,11/18/2021 9:00102,11/18/2021 9:00201,11/18/2021 10:00202,11/18/2021 10:00
To work with this data in Steampipe, export the spreadsheet tabs to files: people.csv
and sessions.csv
. Then install the CSV plugin (steampipe install plugin csv
), edit ~/.steampipe/config/csv.spc
, and set paths
to include to the folder containing the files (in this case, ~/csv
).
connection "csv" {plugin = "csv"paths = [ "~/*.csv" ]}
Now, when you run steampipe query
, those files magically appear as tables in the csv
namespace.
Writing custom controls to check data integrity
As is typical in such scenarios, nothing guarantees that the session numbers are valid. Let's say that the valid sessions are 101-105 and 201-205. Here is a control that checks that condition.
control "sessions_valid_in_session_table" {title = "Sessions in the sessions table are valid"sql = <<EOTselect'event planner' as resource,casewhen (session = any(array['101','102','103','104','105','201','202','203','204','205'])) then 'ok'else 'alarm'end as status,'Sessions in the sessions table are valid' as reasonfromcsv.sessionsEOT}
To "deploy" this control we'll create a directory, event_planning
, and in it place a file, event_planning.sp
, containing the above code. This directory is implicitly a workspace in which Steampipe can run both named queries and mods. Here is our definition of a mod:
A Steampipe mod is a portable, versioned collection of related Steampipe resources such as queries, controls, and benchmarks. Steampipe mods and mod resources are defined in HCL, and distributed as simple text files. Modules can be found on the Steampipe Hub, and may be shared with others from any public git repository.
Mods can be very elaborate. The AWS compliance mod, for example, comprises 10 benchmarks that encapsulate 433 controls and 224 named queries.
But they can also be very simple. If our event_planning.sp
file contains only the code shown above, we can do this to find invalid sessions in the sessions
table.
~/event_planning$ steampipe check all
Checking a more complex table
We also want to know if the sessions mentioned in the people
table are valid. In order to do that, we'll need to project each list of sessions to a set of rows.
with person_session as (selectperson,string_to_array(regexp_replace(sessions, '\s', '', 'g'),',') as sessionsfrom csv.people)selectperson,unnest(sessions) as sessionfromperson_session
+--------+---------+| person | session |+--------+---------+| Joe | 103 || Joe | 104 || Mary | 101 || Mary | 102 || Frank | 101 || Sasha | 101 |+--------+---------+
The regexp_replace
function strips whitespace so that each session id won't include any spaces. The string_to_array
function then turns that string value into an array. These steps will be familiar to everyone whose used Python, or another language, to clean and reshape data.
If you're coming from Python, you can read regexp_replace
as re.sub
and string_to_array
as split
. The unnest
function, though, will be less familiar. It projects a Postgres array to rows, so we end up with multiple rows for Joe and Mary.
Given this setup, here's a control that checks for valid sessions in the sessions
table. We use select distinct
because we don't need to check duplicate occurrences of a session.
control "sessions_valid_in_people_table" {title = "Sessions in the people table are valid"sql = <<EOTwith person_session as (selectperson,string_to_array(regexp_replace(sessions, '\s', '', 'g'),',') as sessionsfrom csv.people),unnested as (select distinctunnest(sessions) as sessionfromperson_session)select'event planner' as resource,casewhen (session = any(array['101','102','103','104','105','201','202','203','204','205'])) then 'ok'else 'alarm'end as status,'Session ' || session as reasonfrom unnestedEOT}
We can now run either of these controls individually, using steampipe check control.sessions_valid_in_sessions_table
or steampipe check control.sessions_valid_in_people_table
. Or we can simply use steampipe check all
to run all the control in this workspace and report all the results with a combined summary.
~/event_planning$ steampipe check all
Of course we'd rather not write the list of sessions twice. So we'll define them using a variable declared at the top of event_planning.sp
.
variable "valid_sessions" {type = list(string)default = ["101","102","103","104","105","201","202","203","204","205"]}
The values are double-quoted, JSON-style. Now we can refer to that variable from each of the controls. Here's the first one, revised accordingly.
control "sessions_valid_in_session_table" {title = "Sessions in the sessions table are valid"sql = <<EOTselect'event planner' as resource,casewhen (session = any( $1 )) then 'ok'else 'alarm'end as status,'Session ' || session as reasonfromcsv.sessionsEOTparam "valid_sessions" {default = var.valid_sessions}}
There can be more than one param
argument corresponding to more than one variable
. In that case, the order of the variable
declarations does not matter, but the order of the param
arguments does. If valid_sessions
were a second param
argument then the placeholder for its interpolation into the query would be $2
.
Alternative outputs
Along with dynamic schemas, the 0.9 release added support for two new flavors of control output: HTML and Markdown. To produce HTML output:
steampipe check all --export=event_planning.html
Conclusion
Dynamic schemas vastly expand the universe of data available to Steampipe. The CSV plugin was the first to exploit the new capability, but already there's another. With the AirTable plugin, contributed by François de Metz, you can bring tables from that system into Steampipe.
If there weren't an AirTable plugin, of course, you could export to CSV and use the CSV plugin. CSV is truly the lingua franca of data exchange; all kinds of systems and apps can export it; the folks at GitTables are annotating and studying 1.7 million tables extracted from CSV files in GitHub!
You can use CSV-sourced data to augment queries and controls that work with Steampipe's growing collection of plugins. But as we've seen here, you can also write simple controls to check the integrity of critical data that's managed, for better and worse, in spreadsheets. Our Steampipe samples repo has the ingredients shown here. Install the CSV plugin, run the controls, then show us how you've used this method to check the integrity of one of your spreadsheets!