Saturday, 25 March 2023

In praise of ETL, part one; E's are good

I've written previously about how at work I've been using an ETL (Extract, Transform, Load) process for customer migrations, but that was mostly in the context of a use case for AWS Step Functions.

Now I want to talk about ETL itself, and how good it's been as an approach. It's been around for a while so one would expect it to have merits, but I've found some aspects to be particularly neat and wanted to call them out specifically. So here we go.

An Extract is a perfect test fixture

I'd never realised this before, but the very act of storing the data you plan on Transforming, and Loading, is tremendously powerful. Firstly, it lets you see exactly what data your Transform was acting upon; secondly, it gives you replay-ability using that exact data (if that's what you want/need) and thirdly, you've got an instant source of test fixture data for checking how your transform code handles that one weird bug that you just came across in production.

My workflow for fixing transform-stage bugs literally became:

  • Locate JSON extract file for the process that failed
  • Save as local JSON file in test fixtures directory of the transform code
  • Write a test to attempt to transform this fixture (or sub-component of it)
  • Test should fail as the production code does
  • Fix transform code, test should now pass
  • Commit fixed code, new test(s) and fixture
  • Release to production
  • Re-run ETL process; bug is gone