
Saturday, 15 June 2024
Home-Cooked and Delicious

Sunday, 25 June 2023
In praise of ETL, part three: Love-you-Load-time
Finishing off my three-part series as a newly-minted ETL fanboi, we get to the Load stage. One could be forgiven thinking there is not a great deal to get excited about at this point; throw data at the new system and it's Job Done. But as usual with ETL, there are hidden pleasures lurking that you might not have considered until you've got your feet wet.
Mechanical Sympathy
In the "customer migration" ETL project I worked on, the output of the Transform stage for a given customer was dumped into a JSON file in an AWS S3 bucket. At Load time, the content of a bucket was scooped out and fed into the target system. Something we noticed quite quickly as the migration project ramped up, was that the new system did not "like" being hit with too many "create new customer" API calls per second. It was pretty simple to implement a rate limit system in the Load stage (only!) to ensure we were being mechanically-sympathetic to the new system, while still being able to go as fast as possible in the other stages.
Optimal Throughput
Indeed, we had a similar rate-limit in our Extract for the benefit of our source system(s) - albeit at a higher rate as its API seemed to be able handle reading a fair bit faster than the new system's API could write. And there's another benefit - we weren't being throttled by the speed of the slower system; we could still extract as fast as the source would allow, transform and buffer into S3, then load at the optimal speed for the new system. You could get fancy and call it Elastic Scaling or somesuch, but really, if we'd used some monolithic process to try and do these customer migrations, we wouldn't have had the fine-grained control.
Idempotency is Imperative
One last tip; strive to ensure your Load stage does not alter the output of the Transform in any way, or you'll lose one of the key advantages of the whole ETL architecture. If you can't look at a transform file (e.g. a JSON blob in an S3 bucket in our case) and know that it's exactly what was sent to the target system, then your debugging just got a whole lot harder. Even something as innocent as populating a createdAt field with new Date() could well bite you (for example if the date has to be in a particular format). If you've got to do something like that, consider passing the date in, in the correct format, as an additional parameter to the Load stage, so there's at least some evidence of what the field was actually set to. There's really nothing worse than not being able to say with confidence what you actually sent to the target system.
We didn't do this, but if there was a "next time" I'd also store a copy of this payload in an S3 bucket as well, just for quick verification purposes.
Saturday, 6 May 2023
In praise of ETL, part two; Trouble-free Transforms
Continuing with my series about the unexpected pleasures of ETL, we get to the Transform stage.
I've already mentioned in the previous post the huge benefit of separating the extraction and transformation stages, giving an almost-limitless source of test fixtures for transform unit tests. On the subject of testing, it strikes me that there's a parallel in ETL with the classic "cost of failure" model that is typically used to justify adoption of unit tests to dinosaur-era organisations that don't use them; viz:
(Graph from DeepSource.com)My contention is that failure in each E/T/L stage has a similar cost profile (of course YMMV, but it applied in our customer-migration scenario);
An error during Extract
- (Assuming an absolute minimum of logic exists in the extraction code)
- Most likely due to a source system being overloaded/overwhelmed by sheer number of extraction requests occurring at once
- Throttle them, and retry the process
- Easy and cheap to restart
- Overall INEXPENSIVE
An error during Transform
- Easy to reproduce via unit tests/fixtures
- Rebuild, redeploy, re-run
- Overall MEDIUM EXPENSE
An error during Load
- Most likely in "somebody else's code"
- Investigation/rectification may require cross-functional/cross-team/cross-company communications
- Re-run for this scenario may be blocked if target system needs to be cleared down
- Overall HIGH EXPENSE
Thus it behooves us (such a great vintage phrase that!) to get our transforms nice and tight; heavy unit-testing is an obvious solution here but also careful consideration of what the approach to transforming questionable data should be. In our case, our initial transform attempts took the pragmatic, Postel-esque "accept garbage, don't throw an error, return something sensible" approach. So upon encountering invalid data for example, we'd log a warning, and transform it to an undefined object or empty array as appropriate.
This turned out to be a problem, as we weren't getting enough feedback about the sheer amount of bad input data we were simply skimming over, resulting in gaps in the data being loaded into the new system.
So in the next phase of development, we became willingly, brutally "fragile", throwing an error as soon as we encountered input data that wasn't ideal. This would obviously result in a lot of failed ETL jobs, but it alerted us to the problems which we could then mitigate in the source system or with code fixes (and unit tests) as needed.
Interestingly, it turned out that in the "long tail" of the customer migration project, we had to return back (somewhat) to the "permissive mode" in order to get particularly difficult customer accounts to be migrated. The approach at that point was to migrate them with known holes in their data, and fix them in the TARGET system.
Here's my crude visualisation of it. I don't know if this mode of code evolution has a name but I found it interesting.
Saturday, 25 March 2023
In praise of ETL, part one; E's are good
I've written previously about how at work I've been using an ETL (Extract, Transform, Load) process for customer migrations, but that was mostly in the context of a use case for AWS Step Functions.
Now I want to talk about ETL itself, and how good it's been as an approach. It's been around for a while so one would expect it to have merits, but I've found some aspects to be particularly neat and wanted to call them out specifically. So here we go.
An Extract is a perfect test fixture
I'd never realised this before, but the very act of storing the data you plan on Transforming, and Loading, is tremendously powerful. Firstly, it lets you see exactly what data your Transform was acting upon; secondly, it gives you replay-ability using that exact data (if that's what you want/need) and thirdly, you've got an instant source of test fixture data for checking how your transform code handles that one weird bug that you just came across in production.
My workflow for fixing transform-stage bugs literally became:
- Locate JSON extract file for the process that failed
- Save as local JSON file in test fixtures directory of the transform code
- Write a test to attempt to transform this fixture (or sub-component of it)
- Test should fail as the production code does
- Fix transform code, test should now pass
- Commit fixed code, new test(s) and fixture
- Release to production
- Re-run ETL process; bug is gone