The Millhouse Group Blog: software

Showing posts with label software. Show all posts

Saturday, 15 June 2024

Home-Cooked and Delicious

I've just read a wonderful article by [Maggie Appleton](https://maggieappleton.com/about) calling out what she terms ["home-cooked" software](https://maggieappleton.com/home-cooked-software) and the "barefoot" developers who work on such things. It really made me want to crack back into publishing *scratch-my-own-itch* software; I've done a [little](https://github.com/themillhousegroup/mondrian) [of](https://github.com/themillhousegroup/scoup) [it](https://github.com/themillhousegroup/react-chromakeyed-image) in the past but a lot of it remains unfinished, either due to being "good enough" or, in probably my most public flame-out, the [Broadlink binding](https://github.com/themillhousegroup/broadlink-openhab-binding) for [openHAB](https://github.com/themillhousegroup/openhab2-addons), being ground down by a PR process (to get the binding integrated into the main codebase) that took so long that it became too onerous for me to continue with it. I've got a few things brewing in my Github that aren't quite ready for public consumption yet, but in the meantime, almost as a counterpoint to my [post at the start of this year](https://blog.themillhousegroup.com/2024/01/my-apps-2024.html) where I shared my totally-*un*-niche list of uncontroversial, mainstream tools I use regularly, here are a couple of **super-barefoot**, totally home-cooked bits of software that excellent people have crafted, that happen to intersect with my niches of interest. I should preface this short list with some background. As keen longtime readers might have [deduced](https://blog.themillhousegroup.com/2022/04/automating-heating-vents-with-openhab.html), I am a lover of Danish plastic. Yep, I'm an AFOL, an Adult Fan Of LEGO. I'm actually such a fan that I have *two* distinct collections; my personal stash of umpteen pieces, which is gradually being formed into a rendition of a railway running through a mythical French town in the 1980s:

As a side-effect of picking the "good bits" out of the many LEGO sets I've been gifted or bought, I've ended up with quite a number (over 11,000 at last count) less desirable (to me) LEGO pieces, and so those are in my "for sale" collection, available for perusal at both the major online LEGO marketplaces, [BrickLink](https://store.bricklink.com/trainfan#/shop) and [BrickOwl](https://trainfan.brickowl.com/). Which brings me to the first bit of home-cooked software. ### BrickSync [BrickSync](http://www.bricksync.net/) does just what you'd expect inventory-management software to do in this instance; it keeps my inventory list in sync between the two marketplaces. Given that there are currently around 18,000 stores in BrickLink and 2,500 in BrickOwl (it's much younger, you can clearly see it in the UI), you would think that there would be, **at most** a worldwide market of 2,500 customers for this application. And yet here, completely free (although a donation is of course welcomed) is an open-source product that works extremely well, **just for us**. It's a little rough around the edges, works in a weird *kinda-CLI, kinda-DOS-app* way and unfortunately *requires* to be run on an x86 platform (so no Raspberry Pi, but luckily I have an Intel NUC that fits the bill) but it deserves a shout-out for getting the job done. Something I've just added to my todo list is to download the [source](https://github.com/ZZJHONS/Bricksync) and make it much more like a traditional Unix CLI program, so that I can schedule synchronization runs with `cron` instead of having to leave a terminal window open while SSHed into my NUC. ### Brickognize Closely-related to the first item. Historically, one of the most painful tasks when weeding out parts from my personal collection into my online for-sale inventory, has been **classification**. BrickLink has been around for a *very* long time, and being the only LEGO marketplace has led to certain naming conventions chosen by the original developer having "stuck". For example, *Light Bluish Gray* is *the* term used in the community, despite being completely different to the official LEGO colour name: *Medium Stone Gray*. Similarly, Bricklink groups similar parts into "categories" in ways that are sometimes difficult to follow. This made finding the right colour, category and part number for the given lump of ABS sitting in front of you quite the challenge. Until now. [Brickognize](https://brickognize.com/) lets you take a picture (e.g. with your phone) of a LEGO part and will give you the most likely parts (including their BrickLink part number) that it corresponds to. There's obviously a very well-trained AI/LLM under there because in my experience (and I've used it probably a hundred times by now) it gets the part exactly right 98% of the time, and if it's not right, it's *very very* close. And again, it's free. It even generously shares an [API](https://api.brickognize.com/docs) so you could build it into a workflow to streamline inventory additions yet further. Awesome. Barefoot developers, you inspire and delight me. Thankyou.

Sunday, 25 June 2023

In praise of ETL, part three: Love-you-Load-time

Finishing off my three-part series as a newly-minted ETL fanboi, we get to the Load stage. One could be forgiven thinking there is not a great deal to get excited about at this point; throw data at the new system and it's Job Done. But as usual with ETL, there are hidden pleasures lurking that you might not have considered until you've got your feet wet.

Mechanical Sympathy

In the "customer migration" ETL project I worked on, the output of the Transform stage for a given customer was dumped into a JSON file in an AWS S3 bucket. At Load time, the content of a bucket was scooped out and fed into the target system. Something we noticed quite quickly as the migration project ramped up, was that the new system did not "like" being hit with too many "create new customer" API calls per second. It was pretty simple to implement a rate limit system in the Load stage (only!) to ensure we were being mechanically-sympathetic to the new system, while still being able to go as fast as possible in the other stages.

Optimal Throughput

Indeed, we had a similar rate-limit in our Extract for the benefit of our source system(s) - albeit at a higher rate as its API seemed to be able handle reading a fair bit faster than the new system's API could write. And there's another benefit - we weren't being throttled by the speed of the slower system; we could still extract as fast as the source would allow, transform and buffer into S3, then load at the optimal speed for the new system. You could get fancy and call it Elastic Scaling or somesuch, but really, if we'd used some monolithic process to try and do these customer migrations, we wouldn't have had the fine-grained control.

Idempotency is Imperative

One last tip; strive to ensure your Load stage does not alter the output of the Transform in any way, or you'll lose one of the key advantages of the whole ETL architecture. If you can't look at a transform file (e.g. a JSON blob in an S3 bucket in our case) and know that it's exactly what was sent to the target system, then your debugging just got a whole lot harder. Even something as innocent as populating a createdAt field with new Date() could well bite you (for example if the date has to be in a particular format). If you've got to do something like that, consider passing the date in, in the correct format, as an additional parameter to the Load stage, so there's at least some evidence of what the field was actually set to. There's really nothing worse than not being able to say with confidence what you actually sent to the target system.

We didn't do this, but if there was a "next time" I'd also store a copy of this payload in an S3 bucket as well, just for quick verification purposes.

Saturday, 6 May 2023

In praise of ETL, part two; Trouble-free Transforms

Continuing with my series about the unexpected pleasures of ETL, we get to the Transform stage.

I've already mentioned in the previous post the huge benefit of separating the extraction and transformation stages, giving an almost-limitless source of test fixtures for transform unit tests. On the subject of testing, it strikes me that there's a parallel in ETL with the classic "cost of failure" model that is typically used to justify adoption of unit tests to dinosaur-era organisations that don't use them; viz:

(Graph from DeepSource.com)

My contention is that failure in each E/T/L stage has a similar cost profile (of course YMMV, but it applied in our customer-migration scenario);

An error during Extract

(Assuming an absolute minimum of logic exists in the extraction code)
Most likely due to a source system being overloaded/overwhelmed by sheer number of extraction requests occurring at once
Throttle them, and retry the process
Easy and cheap to restart
Overall INEXPENSIVE

An error during Transform

Easy to reproduce via unit tests/fixtures
Rebuild, redeploy, re-run
Overall MEDIUM EXPENSE

An error during Load

Most likely in "somebody else's code"
Investigation/rectification may require cross-functional/cross-team/cross-company communications
Re-run for this scenario may be blocked if target system needs to be cleared down
Overall HIGH EXPENSE

Thus it behooves us (such a great vintage phrase that!) to get our transforms nice and tight; heavy unit-testing is an obvious solution here but also careful consideration of what the approach to transforming questionable data should be. In our case, our initial transform attempts took the pragmatic, Postel-esque "accept garbage, don't throw an error, return something sensible" approach. So upon encountering invalid data for example, we'd log a warning, and transform it to an undefined object or empty array as appropriate.

This turned out to be a problem, as we weren't getting enough feedback about the sheer amount of bad input data we were simply skimming over, resulting in gaps in the data being loaded into the new system.

So in the next phase of development, we became willingly, brutally "fragile", throwing an error as soon as we encountered input data that wasn't ideal. This would obviously result in a lot of failed ETL jobs, but it alerted us to the problems which we could then mitigate in the source system or with code fixes (and unit tests) as needed.

Interestingly, it turned out that in the "long tail" of the customer migration project, we had to return back (somewhat) to the "permissive mode" in order to get particularly difficult customer accounts to be migrated. The approach at that point was to migrate them with known holes in their data, and fix them in the TARGET system.

Here's my crude visualisation of it. I don't know if this mode of code evolution has a name but I found it interesting.

Saturday, 25 March 2023

In praise of ETL, part one; E's are good

I've written previously about how at work I've been using an ETL (Extract, Transform, Load) process for customer migrations, but that was mostly in the context of a use case for AWS Step Functions.

Now I want to talk about ETL itself, and how good it's been as an approach. It's been around for a while so one would expect it to have merits, but I've found some aspects to be particularly neat and wanted to call them out specifically. So here we go.

An Extract is a perfect test fixture

I'd never realised this before, but the very act of storing the data you plan on Transforming, and Loading, is tremendously powerful. Firstly, it lets you see exactly what data your Transform was acting upon; secondly, it gives you replay-ability using that exact data (if that's what you want/need) and thirdly, you've got an instant source of test fixture data for checking how your transform code handles that one weird bug that you just came across in production.

My workflow for fixing transform-stage bugs literally became:

Locate JSON extract file for the process that failed
Save as local JSON file in test fixtures directory of the transform code
Write a test to attempt to transform this fixture (or sub-component of it)
Test should fail as the production code does
Fix transform code, test should now pass
Commit fixed code, new test(s) and fixture
Release to production
Re-run ETL process; bug is gone