Tuesday, 12 October 2010

*Facepalm* Moments, Part One

Self-defeating updates

My gig at the satellite-vehicle-tracking company was an interesting experience. A great deal of stuff was seat-of-the-pants, looks-like-it-works-ship-it kinda software because the CEO had put his house on the line and every week we didn't have a solution put him a week closer to being homeless.

Corners were cut. Unit and integration testing was non-existent. As soon as a (flaky as hell) Version 1.0 was in the wild, we operated in what The Daily WTF calls a Developmestruction environment. Our production server was a Windows Server 2003 box that was infected with so many trojans that Internet Explorer took minutes to start up - but we couldn't afford to clean it and reboot in case it never came back.

Good times, good times ;-)

Anyway, the subject of this classic *facepalm* moment from my career is actually the embedded code that ran in the little boxes that got installed underneath car dashboards. It was pretty standard embedded C code - malloc() was banned, circular buffers ruled and everything ran off a gigantic while(1) loop in main.

The firmware actually worked pretty well. I'd managed to iron out all the tiny leaks that caused it to die when left to run overnight, and it dealt nicely with all of the weird and wonderful failure modes the attached GPRS modem could throw at it. (We had a simple serial link to the modem and treated it just like a standard dialup device, firing AT commands to it, setting up a PPP link and resetting it whenever the dreaded NO CARRIER message arrived - which was a lot, GPRS was pretty sketchy in Australia way back in '03). We'd gone through a couple of major version increments, adding some good features and desperately trying to tidy up the code as we went along, when we got to The Big One. Over-The-Air Firmware Upgrades. I had privately always thought that this feature should have been in the code since version one, but apparently there were always more lucrative things to add. I digress.

The process was pretty simple. We'd point the server-side code at a binary file and it would slice it up into packets using our proprietary protocol that ran on top of UDP, firing them one-by-one to the target device. Upon reception of a packet, the device would run a rudimentary checksum over it and ACK or NAK the segment - a NAK triggering a resend. If the segment had arrived intact, it would be copied out to flash, as there wasn't enough RAM to hold the entire firmware image. Once this extremely tedious process was complete (we'd typically push through about one packet per second, and the download used 200+ packets, with probably one-in-ten needing a resend), the box would run another checksum over the entire firmware image before firing off a special processor-specific machine code incantation that caused it to reflash itself and reboot.

So I'm attempting the first-ever over-the-air download with the aforementioned CEO (who designed the process on the back of a napkin) and things are going pretty well until we get to packet 189 of 220, at which point the device seems to lose all connection with the outside world and reboot its GPRS modem. "Oh well," we say "these things happen, let's kick it off again". Four-plus minutes later, the download dies again at packet 189. We go over the obvious possible causes in the code. There's nothing special about the number that would cause an overflow. There's plenty of flash memory left over. All of the circular buffer pointers look sensible. But because the flash-writing process is somewhat finicky, we're loath to attach a debugger and possibly introduce more weirdness. We're working almost entirely in the dark.

On a whim, I decide to fire up Ethereal (as it was then, now Wireshark) and inspect the packets my development box was sending to the device. And sure enough, we find the culprit lurking in packet 189:

    63 64 41 54 45 30 41 54 48 30 45 51 56 31 45 54    cdATE0ATH0ATV1AT 

    4D 30 4E 4F 20 43 41 52 52 49 45 52 41 54 44 54    M0NO CARRIERATDT

Yes, we'd got to the symbol table in the firmware binary, and the device code that was constantly scanning the receive buffer for "NO CARRIER" found exactly what it was looking for, promptly resetting the GPRS modem.


No comments:

Post a Comment