From July 23, 2022, at 10PM, to July 24 at 5PM, Pix was in maintenance mode. The reason? We took advantage of the quiet summer period to migrate the primary identifier of one of the tables in our database from int
to bigint
. This follows the previous migration last August.
In this situation everyone must be aligned. To do the migration we had to ensure that all the impacted applications are in maintenance mode, and that the database is in the right state, then launch the migration script, wait (a long time), check for correct results, and relaunch the applications.
All this requires good tools, and we have those in abundance. Our kit includes FreshPing for website monitoring, Datadog, Scalingo for hosting, and Confluence for documentation.
But not everyone on staff was fully comfortable with this suite of tools. How could we coordinate their use and present a unified view of the migration? For that we turned to Steampipe. We'd already been running daily queries to check the status of production systems. This would be our first use of Steampipe dashboards. Our migration would be the control tower from which we'd track the chronology of:
Putting impacted Pix applications into maintenance mode
Performing the migration
Reopening
The dashboard code is, like many things at Pix, available on GitHub.
Red green
So what does it look like? In step one, everything is initially red.
Once all the impacted systems are paused or shut down, the icons go green. (Did you notice the pretty icons?)
Now everything is off, but we still need to check for applications that might have open connections.
At this point there are still a few open connections, we have to close them to proceed. Once that's done, we can start the migration process.
After a few hours, are we there? Not quite.
A few hours more? Oui!
Under the hood
As we do for everything else, we deployed Steampipe to Scalingo using our Steampipe buildpack. Because we were manipulating production information, we deployed to the osc-secnum-en1
region which is certified by the French security agency ANSSI.
The maintenance checks use the Net plugin to check for expected HTTP 503 code from the applications. For those not directly accessible from the internet, we used the Scalingo plugin to check that all containers were turned off. To verify that monitoring was off we used the FreshPing plugin.
Before starting the migration, we had open connections to PostgreSQL. To make sure those were closed, we used the Datadog plugin which retrieved the information from the logs of an internal service called pix-db-stats
.
PostgreSQL did the heavy lifting, and it wasn't possible to track its internal progress. But we were able to monitor the phases of the migration script using the Datadog plugin to read container logs, then figure out -- and display -- the current step.
Conclusion
The migration went well, and we're glad to have deepened our understanding of Steampipe dashboards. We already knew how to query diverse services and mix them together. Now we also know how to present query results in a nice interface that's easy to start building, and then to iteratively enhance.
Things could still be improved. The monitoring of the migration was a bit fragile. And some phases weren't visible to Steampipe, notably shutdown of the data infrastructure on OVH. We don't expect to do this kind of migration often, but next time it happens the experience we gained this time will enable us to make even better use of Steampipe as our migration control tower.