The Millhouse Group Blog: In praise of ETL, part two; Trouble-free Transforms

Saturday, 6 May 2023

In praise of ETL, part two; Trouble-free Transforms

Continuing with my series about the unexpected pleasures of ETL, we get to the Transform stage.

I've already mentioned in the previous post the huge benefit of separating the extraction and transformation stages, giving an almost-limitless source of test fixtures for transform unit tests. On the subject of testing, it strikes me that there's a parallel in ETL with the classic "cost of failure" model that is typically used to justify adoption of unit tests to dinosaur-era organisations that don't use them; viz:

(Graph from DeepSource.com)

My contention is that failure in each E/T/L stage has a similar cost profile (of course YMMV, but it applied in our customer-migration scenario);

An error during Extract

(Assuming an absolute minimum of logic exists in the extraction code)
Most likely due to a source system being overloaded/overwhelmed by sheer number of extraction requests occurring at once
Throttle them, and retry the process
Easy and cheap to restart
Overall INEXPENSIVE

An error during Transform

Easy to reproduce via unit tests/fixtures
Rebuild, redeploy, re-run
Overall MEDIUM EXPENSE

An error during Load

Most likely in "somebody else's code"
Investigation/rectification may require cross-functional/cross-team/cross-company communications
Re-run for this scenario may be blocked if target system needs to be cleared down
Overall HIGH EXPENSE

Thus it behooves us (such a great vintage phrase that!) to get our transforms nice and tight; heavy unit-testing is an obvious solution here but also careful consideration of what the approach to transforming questionable data should be. In our case, our initial transform attempts took the pragmatic, Postel-esque "accept garbage, don't throw an error, return something sensible" approach. So upon encountering invalid data for example, we'd log a warning, and transform it to an undefined object or empty array as appropriate.

This turned out to be a problem, as we weren't getting enough feedback about the sheer amount of bad input data we were simply skimming over, resulting in gaps in the data being loaded into the new system.

So in the next phase of development, we became willingly, brutally "fragile", throwing an error as soon as we encountered input data that wasn't ideal. This would obviously result in a lot of failed ETL jobs, but it alerted us to the problems which we could then mitigate in the source system or with code fixes (and unit tests) as needed.

Interestingly, it turned out that in the "long tail" of the customer migration project, we had to return back (somewhat) to the "permissive mode" in order to get particularly difficult customer accounts to be migrated. The approach at that point was to migrate them with known holes in their data, and fix them in the TARGET system.

Here's my crude visualisation of it. I don't know if this mode of code evolution has a name but I found it interesting.

The Millhouse Group Blog

Saturday, 6 May 2023

In praise of ETL, part two; Trouble-free Transforms

An error during Extract

An error during Transform

An error during Load

No comments:

Post a Comment