Saturday, 6 May 2023

In praise of ETL, part two; Trouble-free Transforms

Continuing with my series about the unexpected pleasures of ETL, we get to the Transform stage.

I've already mentioned in the previous post the huge benefit of separating the extraction and transformation stages, giving an almost-limitless source of test fixtures for transform unit tests. On the subject of testing, it strikes me that there's a parallel in ETL with the classic "cost of failure" model that is typically used to justify adoption of unit tests to dinosaur-era organisations that don't use them; viz:

(Graph from DeepSource.com)

My contention is that failure in each E/T/L stage has a similar cost profile (of course YMMV, but it applied in our customer-migration scenario);

An error during Extract
  • (Assuming an absolute minimum of logic exists in the extraction code)
  • Most likely due to a source system being overloaded/overwhelmed by sheer number of extraction requests occurring at once
  • Throttle them, and retry the process
  • Easy and cheap to restart
  • Overall INEXPENSIVE
An error during Transform
  • Easy to reproduce via unit tests/fixtures
  • Rebuild, redeploy, re-run
  • Overall MEDIUM EXPENSE
An error during Load
  • Most likely in "somebody else's code"
  • Investigation/rectification may require cross-functional/cross-team/cross-company communications
  • Re-run for this scenario may be blocked if target system needs to be cleared down
  • Overall HIGH EXPENSE

Thus it behooves us (such a great vintage phrase that!) to get our transforms nice and tight; heavy unit-testing is an obvious solution here but also careful consideration of what the approach to transforming questionable data should be. In our case, our initial transform attempts took the pragmatic, Postel-esque "accept garbage, don't throw an error, return something sensible" approach. So upon encountering invalid data for example, we'd log a warning, and transform it to an undefined object or empty array as appropriate.

This turned out to be a problem, as we weren't getting enough feedback about the sheer amount of bad input data we were simply skimming over, resulting in gaps in the data being loaded into the new system.

So in the next phase of development, we became willingly, brutally "fragile", throwing an error as soon as we encountered input data that wasn't ideal. This would obviously result in a lot of failed ETL jobs, but it alerted us to the problems which we could then mitigate in the source system or with code fixes (and unit tests) as needed.

Interestingly, it turned out that in the "long tail" of the customer migration project, we had to return back (somewhat) to the "permissive mode" in order to get particularly difficult customer accounts to be migrated. The approach at that point was to migrate them with known holes in their data, and fix them in the TARGET system.

Here's my crude visualisation of it. I don't know if this mode of code evolution has a name but I found it interesting.

Sunday, 16 April 2023

Micro-Optimisation #393: More Log Macros!

I've posted some of my VSCode Log Macros previously, but wherever there is repetitive typing, there are further efficiencies to be gleaned!

Log, Label and Prettify a variable - [ Ctrl + Option + Command + J ]

You know what's better than having the contents of your console.log() autogenerated?

Having the whole thing inserted for you!

How do I add this?

On the Mac you can use ⌘-K-S to see the pretty shortcut list, then hit the "Open Keyboard Shortcuts (JSON)" icon in the top-right to get the text editor to show the contents of keybindings.json. And by the way, execute the command Developer: Toggle Keyboard Shortcuts Troubleshooting to get diagnostic output on what various special keystrokes map to in VSCode-speak (e.g. on a Mac, what Ctrl, Option and Command actually do)

keybindings.json
// Place your key bindings in this file to override the defaults
[
{
    "key": "ctrl+meta+alt+j", 
    "when": "editorTextFocus",
    "command": "runCommands",
    "args": {
      "commands": [
        {
          "command": "editor.action.copyLinesDownAction"
        },
        {
          "command": "editor.action.insertSnippet",
          "args": {
            "snippet": "\nconsole.log(`${TM_SELECTED_TEXT}: ${JSON.stringify(${TM_SELECTED_TEXT}$1, null, 2)}`);\n"
          }
        },
        {
          "command": "cursorUp"
        },
        {
          "command": "editor.action.deleteLines"
        },
        {
          "command": "cursorDown"
        },
        {
          "command": "editor.action.deleteLines"
        },
      ],
    }
  }
]

This one uses the new (for April 2023, VSCode v1.77.3) runCommands command, which, as you might infer, allows commands to be chained together in a keybinding. A really nice property of this is that you can Command-Z your way back out of the individual commands; very helpful for debugging the keybinding, but also potentially just nice-to-have.

The trick here is to retain the text selection so that ${TM_SELECTED_TEXT} can continue to contain the right thing, without clobbering whatever might be in the editor clipboard at this moment. We do this by copying the line down. This helpfully keeps the selection right on the variable where we want it. We then blast over the top of the selection with the logging line, but by sneakily inserting \n symbols at each end, we break up the old line into 3 lines, where the middle one is the only one we want to keep. So we delete the above and below.

Saturday, 25 March 2023

In praise of ETL, part one; E's are good

I've written previously about how at work I've been using an ETL (Extract, Transform, Load) process for customer migrations, but that was mostly in the context of a use case for AWS Step Functions.

Now I want to talk about ETL itself, and how good it's been as an approach. It's been around for a while so one would expect it to have merits, but I've found some aspects to be particularly neat and wanted to call them out specifically. So here we go.

An Extract is a perfect test fixture

I'd never realised this before, but the very act of storing the data you plan on Transforming, and Loading, is tremendously powerful. Firstly, it lets you see exactly what data your Transform was acting upon; secondly, it gives you replay-ability using that exact data (if that's what you want/need) and thirdly, you've got an instant source of test fixture data for checking how your transform code handles that one weird bug that you just came across in production.

My workflow for fixing transform-stage bugs literally became:

  • Locate JSON extract file for the process that failed
  • Save as local JSON file in test fixtures directory of the transform code
  • Write a test to attempt to transform this fixture (or sub-component of it)
  • Test should fail as the production code does
  • Fix transform code, test should now pass
  • Commit fixed code, new test(s) and fixture
  • Release to production
  • Re-run ETL process; bug is gone

Monday, 27 February 2023

Stepping up, and back, with the new Next.js "app" directory

I'm toying around with a new web-based side project and I thought it was time to give the latest Next.js version a spin. Although I've used Create-React-App (generally hosted on Netlify) more recently, I've dabbled with Next.js in one capacity or another since 2018, and this time some server-side requirements made it a better choice.

The killer feature of the 2023 "beta version" of Next.js (which I assume will eventually be named Next.js 14) is the app directory, which takes Next's already-excellent filesystem-based routing (i.e. if you create a file called bar.tsx in a directory called foo, you'll find it served up at /foo/bar without writing a line of code) and amps it up. A lot.

I won't try and reiterate their excellent documentation, but their nested layouts feature is looking like an absolute winner from where I'm sitting, and I'd like to explain why by taking you back in time. I've done this before when talking about React-related stuff when I joked that the HTML <img> tag was like a proto-React component. And I still stand by that comparison; I think this instant familiarity is one of the fundamental reasons why React has "won" the webapp developer mindshare battle.

Let me take you back to 1998. The Web is pretty raw, pretty wild, and mostly static pages. My Dad's website is absolutely no exception. I've meticulously hand-coded it in vi as a series of stand-alone HTML pages which get FTP'ed into position on his ISP's web server. Although I'm dimly aware of CSS, it's mainly still used for small hacks like removing the underlines from links (I still remember being shown the way to do this with an inline style tag and thinking it would never take off) - and I'm certainly not writing a separate .css file to be included by every HTML file. As a result, everything is styled "inline" so-to-speak, but not even in the CSS way; just mountains of widths and heights and font faces all over the place. It sucked, but HTML was getting better all the time so we just put up with it, used all that it offered, and automated what we could. Which was exactly what I did. If you dare to inspect the source of the above Wayback Machine page, you'll see that it uses HTML frames (ugh), which was a primitive way of maintaining a certain amount of UI consistency while navigating around the site.

The other thing I did to improve UI consistency, was a primitive form of templating. Probably more akin to concatenation, but I definitely had a header.htm which was crammed together with (for-example) order-body.htm to end up with order.htm using a DOS batch file that I ran to "pre-process" everything prior to doing an FTP upload - a monthly occurrence as my Dad liked to keep his "new arrivals" page genuinely fresh. Now header.htm definitely wasn't valid HTML as it would have had unclosed tags galore, but it was re-used for several pages that needed to look the same, and that made me feel efficient.

And this brings me to Next.js and the nesting layouts functionality I mentioned before. To achieve what took me a pile of HTML frames, some malformed HTML documents and a hacky batch file, all I have to do is add a layout.tsx and put all the pages that should use that UI alongside it. I can add a layout.tsx in any subdirectory and it will apply from there "down". Consistency via convention over configuration, while still nodding to the hierarchical filesystem structures we've been using since Before The Web. It's just really well thought-out, and a telling example of how much thought is going into Next.js right now. I am on board, and will be digging deeper this year for sure.

Sunday, 29 January 2023

Sneaking through the Analog Hole

I perhaps-foolishly recently agreed to perform a media-archiving task. A series of books-on-tape (yes, on physical audio cassettes), almost unplayable at this point in the century, needed to be moved onto a playable media. For this particular client, that meant onto Audio CDs (OK so we're moving forward, but not too far!). I myself didn't have a suitable playback device, but quickly located a bargain-priced solution, second-hand on eBay (of course) - an AWA E-F34U that appears to be exclusively distributed by the Big W retail chain here in Australia:

This device purports to be a one-USB-cable solution to digitising the contents of analogue cassettes. Unfortunately, the example I just purchased had extremely severe issues with its USB implementation. The audio coming straight off the USB cable would jump between perfectly fine for a few seconds, to glitchy, stuttering and repeating short sections, to half-speed slooooow with the attendant drop in pitch. Unusable.

I only hope that the problem is isolated to my unit (which was cheap and described as "sold untested" so I have no-one to blame but myself) - if not, someone's done a really bad job at their USB Audio implementation. Luckily, the USB Power works absolutely fine, so I had to resort to the old "Analog Hole" solution via my existing (rather nice) USB Audio Interface, a Native Instruments Komplete Audio 1 which I picked up after my previous interface, a TASCAM FireOne, finally kicked the bucket.

In the following picture, you can see my digitising solution. AWA tape transport (powered by USB) to 3.5mm headphone socket, through a 1/4" adaptor to a short guitar lead and into the Komplete Audio 1's Line In. From there, it goes in via the KA1's (fully-working!) USB connection to GarageBand on the Mac. A noise gate and a little compression are applied, and once each side of each tape has been captured, it gets exported directly to an MP3 file. I intend to present the client with not only the Audio CDs but also a data CD containing these MP3s so that future media formats can hopefully be more easily accommodated.

What if I didn't already have a USB audio interface? Would the client have given up, with their media stuck in the analog era, never to be heard again?

It amused me that analog technology was both the cause of this work - in that this medium and the ability to play it has gone from ubiquitous in the 1980s to virtually extinct - and its solution, using an analog interface to get around a deficient digital one.

Sunday, 18 December 2022

Three D's of 3D Printing in 2022

I've been somewhat fascinated with 3D printing ever since becoming aware of it a decade ago, but it was prohibitively expensive to get into it when machines were in the four-digit USD$ range and seemed likely to be limited to somewhat-unreliably producing useless tchotchkes at vastly higher cost. Things have changed. A lot.

Declining costs

Cost of entry
My new(ish) printer is the Cocoon Create Modelmaker, a respin/reskin of the Wanhao i3 Mini - which if you follow the links, you'll note is USD$199 brand new, but I got mine second-hand on eBay for AUD$100. I'm a sucker for an eBay bargain. When I picked it up, the seller (who was upgrading to a model with a larger print bed) also gave me a crash course in printing and then threw in an almost-full 300m spool of filament to get me started - another AUD$20 saved. So I'm already at the f*@k it point as far as up-front investment goes.

Cost of materials
I'm picking up 300m rolls of PLA filament from eBay for AUD$20-$24 delivered, and I'm choosing local suppliers so they typically get delivered within 3 days. I could go even cheaper if I used Chinese suppliers. The biggest thing I've printed so far was a case for my Raspberry Pi 3B+, part of a 19" rack mount setup (I'm also a sucker for anything rackmounted) - that took 21 hours and used about 95c of filament. So really, it's starting to approach "free to make" as long as you don't place too much value on your own time...

Damn Fine Software

Seven years ago, Scott Hanselman documented his early experiences with 3D printing; there was a lot of rage and frustration. Maybe I've just been lucky, maybe buying a printer that had already been used, tweaked, and enhanced (with 3d-printed upgrade parts) was a galaxy-brain genius move, but honestly, I've had very little trouble, and I'd estimate less than 50c of material has ended up in the bin. Happy with that. I think the tools have moved on supremely in that time, and awesomely, they're all FREE and most are also Open-Source.

Cura
Ultimaker Cura takes an STL file and "slices" it into something your actual hardware can print, via a G-Code file. It's analogous to the JVM taking generic Java bytecodes and translating them to x86 machine language or whatever. Anyway, it does a great job, and it's free.

OctoPrint
My first 3D printing "toolchain" consisted of me slicing in Cura on my Mac followed by saving the file to a micro SD card (via an adapter), then turning around to place the (unsheathed) micro SD card into my printer's front-panel slot, and instructing it to print. This was fine, but the "sneakernet"-like experience was annoying (I kept losing the SD adapter) and the printer made a huge racket being on a table in the middle of the room. Then I discovered OctoPrint, an open-source masterpiece that network-enables any compatible 3D printer with a USB port. I pressed my otherwise-idle 5-series Intel NUC into service and it's been flawless, allowing me to wirelessly submit jobs to the printer, which now resides in a cupboard, reducing noise and increasing temperature stability (which is good for print quality)

Tinkercad
It didn't take long for me to want a little more than what Thingiverse et al could provide. Thingiverse's "Remix" culture is just awesome - a hardware equivalent to open-sourced software - but my experience of CAD was limited to a semester in university bashing up against Autodesk's AutoCAD, so I figured it would just be too hard for a hobbyist like me to create new things. Then I discovered Tinkercad, a free web application by, of all companies, Autodesk! This app features one of the best tutorial introductions I've ever seen; truly, I could get my 10-year old daughter productive in this software thanks to that tutorial. And the whole thing being on the web makes it portable and flexible. Massive kudos to Autodesk for this one.

Do It

The useless tchotchke era is over; I've been using my printer to replace lost board game tokens, organise cables, rackmount loose devices, and create LEGO parts that don't exist yet. As far as I'm concerned it's virtually paid for itself already, and I'm still getting better as a designer and operator of the machine. If you've been waiting for the right time to pounce, I strongly recommend picking up a used 3D printer and giving it a whirl.

Saturday, 26 November 2022

AWS Step Functions - a pretty-good v1.0

I've been using Amazon's Step Functions functionality a fair bit at work recently, as a way to orchestrate and visualise a migration process that involves some Extract-Transform-Load steps and various other bits, each one being an AWS Lambda.

On the whole, it's been pretty good - it's fun to watch the process chug along with the flowchart-like UI automagically updating (I can't show you any screenshots unfortunately, but it's neat). There have been a couple of reminders however that this is a version 1.0 product, namely:

Why can't I resume where I failed before?

With our ETL process, frequently we'll detect a source data problem in the Extract or Transform stages. It would be nice if after fixing the data in place, we could go back to the failed execution and just ask it to resume at the failed step, with all of the other "state" from the execution intact.

Similarly, if we find a bug in our Extract or Transform lambdas themselves, it's super-nice to be able to monkey-patch them right there and then (remembering of course to update the source code in Git as well) - but it's only half as nice as it could be. If we could fix the broken lambda code and then re-run the execution that uncovered the bug, the cycle time would be outstanding

Why can't you remember things for me?

Possibly-related to the first point, is the disappointing discovery that Step Functions have no "memory" or "context" if you prefer, where you can stash a variable for use later in the pipeline. That is you might expect to be able to declare 3 steps like this:

    Extract Lambda
      Inputs:
        accountId
      Outputs: 
        pathToExtractedDataBucket
        
    Transform Lambda
       Inputs:
         pathToExtractedDataBucket
       Outputs:
         pathToTransformedDataBucket
         
    Load Lambda
       Inputs:
         accountId
         pathToTransformedDataBucket
       Outputs:
         isSuccessful
  
But unfortunately that simply will not work (at time of writing, November 2022). The above pipeline will fail at runtime because accountId has not been passed through the Transform lambda in order for the Load lambda to receive it!

For me, this really makes a bit of a mockery of the reusability and composability of lambdas with step functions. To fix the situation above, we have to make the Extract Lambda emit the accountId and Transform Lambda aware of and pass through accountId even though it has no interest in, or need for it!; that is:

   Extract Lambda
      Inputs:
        accountId
      Outputs: 
        accountId
        pathToExtractedDataBucket
        
    Transform Lambda
       Inputs:
         accountId
         pathToExtractedDataBucket
       Outputs:
         accountId
         pathToTransformedDataBucket
         
    Load Lambda
       Inputs:
         accountId
         pathToTransformedDataBucket
       Outputs:
         isSuccessful
  
That's really not good in my opinion, and makes for a lot of unwanted cluttering-up of otherwise reusable lambdas, dealing with arguments that they don't care about, just because some other entity needs them. Fingers crossed this will be rectified soon, as I'm sure I'm not the first person to have been very aggravated by this design.