Python

The biggest and last website to move over to new hardware was THOK.ORG itself. Bits of this website go back decades, to a slightly overclocked 486DX/25 on a DSL line - while static websites have some significant modern advantages, the classic roots are in "not actually having real hardware to run one". That said, it does have a lot of sentimental value, and a lot of personal memory - mainly personal project notes, for things like "palm pilot apps" or "what even is this new blogging thing" - so I do care about keeping it running, but at the same time am a little nervous about it.

(Spoiler warning: as of this posting, the conversion is complete and mostly uneventful, and I've made updates to the new site - this is just notes on some of the conversion process.)

Why is a static site complicated?

"static site" can mean a lot of things, but the basic one is that the web server itself only delivers files over http/https and doesn't do anything dynamic to actually deliver the content.1 This has security benefits (you don't have privilege boundaries if there are no privileges) and run-time complexity benefits (for one example, you're only using the most well-tested paths through the server code) but it also has testing and reliability benefits - if you haven't changed anything in the content, you can reasonably expect that the server isn't going to do anything different with it, so if it worked before, it works now.

This also means that you will likely have a "build" step where you take the easiest-to-edit form and turn it into deliverable HTML. Great for testing - you can render locally, browse locally, and then push the result to the live site - but it does mean that you want some kind of local tooling, even if it's just the equivalent of find | xargs pandoc and a stylesheet.

For THOK.ORG, I cared very little about style and primarily wanted to put up words (and code snippets) - Markdown was the obvious choice, but it hadn't been invented yet! I was already in the habit of writing up project notes using a hotkey that dropped a username and datestamp marker in a file, and then various "rich text" conventions from 1990's email (nothing more than italic, bold, and code) - I wasn't even thinking of them as markup, just as conventions that people recognized in email without further rendering. So while the earliest versions of the site were just HTML, later ones were a little code to take "project log" files and expand them into blog-like entries. All very local, READMEREADME.html and that was it.

Eventually I wrote a converter that turned the project logs into "proper" markdown - not a perfect one (while using a renderer helped bring my conventions in line with what rendered ok, I never managed to really formalize it and some stuff was just poorly rendered), just one that was good enough that I could clean up the markdown by hand and go all in on it. There was a "side trip" of using Tumblr as a convenient mobile blogging service - phone browsers were just good enough that I could write articles in markdown on a phone with a folding bluetooth keyboard at the pycon.ca conference (2012) and get stuff online directly - I didn't actually stick with this and eventually converted them back to local markdown blogs (and then still didn't update them.)

Finally (2014 or so) I came up with a common unifying tool to drag bits of content together and do all of the processing for the content I'd produced over the years. thoksync included a dependency declaration system that allowed parallelized processing, and various performance hacks that have been overtaken by Moore's Law in the last decade. The main thing is that it was fast enough to run in a git post-update hook so when I pushed changes to markdown files, they'd get directly turned into live site updates. Since I was focussed on other things in the meantime (including a new startup in 2015) and the code worked I hadn't really touched it in the last decade... so it was still python 2 code.

Python 2 to Python 3 conversion

Having done a big chunk of work (including a lot of review, guidance, and debugging) on a python 3 conversion of a commercial code base, I was both familiar with the process and had not expected to ever need to touch it again - the product conversion itself was far later than was in any way reasonable, and most other companies would have been forced to convert sooner. It was a bit of a surprise to discover another 2000+ lines of python 2 code that was My Problem!

While there were only a few small CLI-tool tests in the code (which I was nonetheless glad to have) I did have the advantage of a "perfect" test suite - the entire thok.org site. All I had to do was make sure that the rendering from the python 3 code matched the output from the python 2 code - 80,000 lines of HTML that should be the same should be easy to review, right?

This theory worked out reasonably well at first - any time the partially converted code crashed, well, that was obviously something that needed fixing.

Here in 2025, with Python 3.14 released and the Python Documentary published, noone really cares about the conversion process as anything but a historical curiousity... but I had a bunch of notes about this particular project so I might as well collect them in one place..

  • Trivia
    • #! update (I prefer /usr/bin/python3 but there are solid arguments that /usr/bin/env python3 is better; I just don't happen to use venv or virtualenv, so for my workflow they're equivalent.)
    • printprint(), >>file= - print itself was one of the original big obnoxious changes that broke Python 2 code instantly, it wasn't until relatively late that from __future__ import print_function came along, which didn't help existing code but gave you a chance to upgrade partially and have shared code that was still importable from both versions. (Sure, library code shouldn't call print - it still did, so it was still a source of friction. Personally I would have preferred a mechanism for paren-less function calls or definitions... but I wanted that when I first started using Python 2, and it was pretty clear that it wasn't going to happen. M-expressions didn't catch on either...
    • Popen(text=True) was a fairly late way of saying "the python 2 behaviour was fine for most things, let's have that back instead of littering every read and write with conversion code." (universal_newlines=True did the same thing earlier, kind of accidentally.)
    • file()open() wasn't particularly important.
    • longint (only in tumblr2thoksync, most of this code was string handling, not numeric) - this was just dropping an alias for consistency, they'd long been identical even in Python 2.
    • import rfc822import email.utils (parsedate and formatdate were used in a few RSS-related places. Just (reasonable) reorganization, the functions were unchanged.
    • SimpleHTTPServer, BaseHTTPServerhttp.server
    • isinstance(basestring)isinstance(str) string/byte/unicode handling was probably the largest single point where reasoning about large chunks of code from a 2-and-3 perspective was necessary; it's also somewhere that having type hints in python 2 would have been an enormous help, but the syntax didn't exist. Fortunately, for this project none of the subtleties applied - most of the checks were really that something was not an xml.etree fragment, it didn't matter at all what kind of string it was.
  • Language improvements
    • except as - nicer to stuff an exception that you're intentionally poking at into a relevantly-named variable instead of rummaging around in sys.exc_info. (raise from is also great but nothing in this codebase needed it.)
    • f=open()with open() as f encourages paying attention to file handle lifetimes, reducing risk of handle leakage, and avoiding certain classes of bugs caused by files not flushing when you expect them to ("when the handle gets garbage collected" vs. the much more explicit and visible "when you leave the scope of the with clause.)
    • argument "tuple unpacking" is gone - this wasn't an improvement so much as "other function syntax made it harder to get this right and it wasn't used much, and there was a replacement syntax (explicit unpacking) so it was droppable." Not great but maybe it was excessively clever to begin with.
    • Python 2 allowed sorting functions by id; Python 3 doesn't, so just extract the names in key= (the actual order never mattered, just that the sort was consistent within a run.)
  • Third-party library changes (after all, if your callers need massive changes anyway, might as well clean up some of your own technical debt, since you can get away with incompatible changes.)
    • Markdown library
      • markdown.inlinepatterns.PatternInlineProcessor (the old API still exists, but some porting difficulties meant that the naïve port wasn't going to work anyway, so it made sense to debug the longer-lived new API.)
      • etree no longer leaked from markdown.util (trivial)
      • grouping no longer mangled, so .group(1) is correct and what I'd wanted to use in the first place
      • addregister (trivial)
      • different return interface
      • string hack for WikiLinkExtension arguments no longer works; the class-based interface was well documented and had better language level sanity checking anyway.
    • lost the feedvalidator package entirely so minivalidate.py doesn't actually work yet (probably not worth fixing, external RSS validators are more well cared for and independent anyway.
    • lxml.xml.tostringencoding="unicode" in a few places to json-serialize sanely
      • in a few places, keep it bytes but open("w" → "wb") instead

Once the tooling got to the point where it ran on the entire input without crashing, the next "the pre-existing code is by definition correct" test was to just diff the built site (the output) with the existing Python 2 version. The generated HTML code converged quickly, but it did turn up some corrupted jpg files and other large binaries; these were all repairable from other sources, but does suggest that using more long-term content verification (or at very least, "checking more things into git") should be an ongoing task. (All of the damage was recoverable, it was just distressing that it went undiscovered as long as it did.)

Attempting to get out of the blog tooling business

The tooling described here evolved around a particular kind of legacy data and ideas, and isn't really shaped appropriately for anyone else to use - it isn't even well-shaped for me to use on any other sites. While the port did allow me to do some long-overdue content maintenance of thok.org itself, it was getting in the way of a number of other web-writing projects. Attempting to apply the Codes Well With Others principle, I dug into using staticsite, which was simple, written in Python 3, based on markdown and Jinja2 and had at least some recent development work. I ended up using it for several sites including this one, though not thok.org itself (at this time.)

I may end up going back and doing a replacement for staticsite, but I expect to keep it shaped like staticsite so I can use it as a drop-in replacement for the current handful of sites, and it's really worked pretty well. (I will probably try to start with just a replacement template - using plain HTML rather than upgrading to a current version of React - since most of what I want is very simple.) The other possibility is to move to pandoc as the engine, because it tries hard in entirely different ways.

Things Left Behind

The old system had a notification mechanism called Nagaina that had a plugin mechanism for "probes" (AFS, Kerberos, NTP, disks, etc.) It had a crude "run all current probes, then diff against the previous run and notify (via Zephyr) if anything changed". The biggest flaw of this approach was that it relied on sending messages via MIT's Zephyr infrastructure; the second biggest was that it actually worked so I didn't feel that compelled to improve it (or move to something else.)

The new system has a bunch of systemd timer jobs that do things and reports on them by email; OpenAFS notification is gone because the cell is gone, and other things have simpler failure modes and just need less monitoring. I have an extensive folder of possible replacement notification mechanisms - some day I'll pick one and then work backwards to tying anomaly detection and alerting into it.


  1. This definition of static doesn't preclude things with client-side javascript - I've seen one form of static site where the server delivered markdown files directly to the client and the javascript rendered them there, which is almost clever but requires some visible mess in the files, so I've never been that tempted; it would also mean implementing my own markdown extensions in javascript instead of python, and... no. 

KPhotoAlbum has lots of built-in features, but in practice the more convenient1 way to interface with it from simple unix tools is to just operate on the index.xml where all of the metadata is stored.2

kpa-grep

I started kpa-grep back in 2011, around when I hit 90k pictures (I'm over 200k now.) The originally documented use case was kpa-grep --since "last week" --tags "office" which was probably for sorting work pictures out from personal ones. (The fuzzy timedateparser use was there from day one; since then, I'm not sure I've used anything other than "last week" or "last month", especially since I never implemented date ranges.) I've worked on it in bursts; usually there's feedback between trying to do something with a sub-gallery, trying to script it, and then enhancing kpa-grep to handle it. The most recent burst added two features, primarily inspired by the tooling around my Ice Cream Blog -

  • A sqlite-based cache of the XML file. Back in the day it took 6-8 seconds to parse the file, on a modern laptop with SSD and All The RAM it's more like 2.5 seconds - the sqlite processing takes a little longer than that but subsequent queries are near-instant, which makes it sensible to loop over kpa-grep output and do more kpa-grep processing on it. A typical "pictures are ready, create a dummy review post for the last ice cream shop with all pictures and some metadata" operation was over a minute without the cache, and is now typically 5-10 seconds even with a stale cache.
  • Better tag support - mostly fleshing out unimplemented combinations of options, but in particular allowing --tag and --since to filter --dump-tags, which let me pick out the most recent Locations which are tagged ice cream, filter out city names, and have a short list of ice cream shops to work with. (Coming soon: adding some explicit checks of them against which shops I've actually reviewed already.)

As far as I know I don't have any users, but nonetheless it is on github, so I've put some effort into keeping it clean3; recently that's also included coming up with a low-effort workflow for doing releases and release artifacts. This is currently a shell script involving debspawn build, dpkg-parsechangelog, and gh release upload which feels like an acceptable amount of effort for a single program with a man page.

pojkar

pojkar is a collection of Flickr upload tools that work off of KPhotoAlbum.4 The currently active tools are sync-to-flickr and auto-cropr.

sync-to-flickr

sync-to-flickr is the engine behind a simple workflow: when I'm reviewing photos in KPhotoAlbum, I choose particular images for posting by adding the Keyword tag flickr to the image. Once I've completed a set and quit out of KPhotoAlbum, I run sync-to-flickr sync which looks for everything tagged flickr, uploads it to Flickr with a title, description, and rotation (and possibly map coordinates, except there are none of those in the current gallery.) There's also a retry mechanism (both flickr's network and mine have improved in the last decade so this rarely triggers.) Once a picture has been uploaded, a flickd tag is added to it, so future runs know to skip it.

After all of that, the app collects up the tags for the posted set of pictures; since social media posting5 has length limits (and since humans are reading the list) we favor longer names and names that appear more often in the set; then we drop tags that are substrings of other tags (dropping Concord in favor of Concord Conservation Land since the latter implies the former well enough.) Finally we truncate the list to fit in a post.

auto-cropr

Flickr has an obscure6 feature where you could select a rectangle on a picture (in the web interface) and add a "note" to that region. auto-cropr used the API to look for recent instances of that which contained a magic string - then picked up the geometry of the rectangle and cropped just that area, posting it as a new flickr picture - and then cross linking them, replacing the original comment with a link to the new image. Basically this let you draw the viewer's attention to a particular area and then let them click to zoom in on it and get more commentary as well as a "closeup".

Note that these "views" are only on Flickr, I don't download or back them up at all (I should fix that.)

fix-kpa-missing/kpa-insert

As part of the Nokia 6630 image fixing project there ended up being a couple of different cleanups which I needed to review carefully, so I wanted the tools to produce diffable changes, which lxml doesn't really guarantee7. Currently, the XML written out by KPhotoAlbum is pretty structured - in particular, any image with no tags is a one-line <image ... /> and I was particularly looking to make corrections to things that were fundamentally untagged8/untaggable (for fix-kpa-missing) or insert lines that were already one-line-per-picture, I just had to get them in the right place.

When I started the image recovery, I ended up just adding a bunch of images with their original datestamps (from 2005), but KPhotoAlbum just added them to the end of the index (since they were "new" and I don't use the sorting features.) So I had the correct lines for each image (which checksums and dimensions), I could just chop them out of the file. Then kpa-insert takes these lines, and walks through the main index as well. For basically any line that doesn't begin with <image it just copies it through unchanged to the new index; when it finds an image line, it grabs the attributes (startDate, md5sum, and pathname specifically) and then checks them against the current head of the insertion list9. Basically, if the head of the index was newer than the head of the insertions, copy insertions over until that's no longer true. If they match exactly - the original version just bailed so I could look at them, then once I figured out that they really were duplicates, I changed it to output rm commands for the redundant files (and only kept the "more original" line from the original index.)

The output was a diffable replacement index that I could review, and check that the "neighbor" <image> entries made sense, and that nothing was getting added from elsewhere in the tree, and other basic "eyeball" checks. Since I had to do this review anyway to make sure I hadn't made any mistakes of intent, it made sense to write the code in a "direct but brittle" style - anything weird, just bail with a good traceback; I wouldn't even look at the diffs until the code "didn't find anything weird." That also meant that I'd done the least amount of work10 necessary to get the right result - basically a degenerate case of Test Driven Development, where there's one input (my existing index) and one test (does the new index look right.)

I also didn't have any of my usual user interface concerns - noone (not even me) was ever going to run this code after making this one change. I did keep things relatively clean with small helper functions because I expected to mine it for snippets for later problems in the same space - which I did, almost immediately.

For fix-kpa-missing, I'd noticed some "dead space" in the main KPhotoAlbum thumbnail view, and figured that it was mostly the result of an old trailcam8 project. I was nervous about "losing" metadata that might point me at images I should instead be trying to recover, but here was a subset that I knew really were (improperly but correctly) discarded images - wouldn't it be nice to determine that they were the only missing images and clean it up once and for all?

So, the same "only look at <image lines" code from kpa-insert, extract the pathname from the attributes, and just check if the file exists; I could look for substrings of the pathname to determine that it was a trailcam pic and was "OK", plus I could continue with the "direct but brittle" approach and check that each stanza I was removing didn't have any tags/options - but just blow up if it found them. Since it found none, I knew that

  • I had definitely not (mis-)tagged any of the discarded pictures
  • I didn't have to write the options-handling code at all. (I suspect I will eventually need this, but the tools that are likely to need it will have other architectural differences, so it makes sense to hold off for now.)

There were a couple of additional scripts cobbled up out of these bits:

  • fix-kpa-PAlbTN which looked for Photo Album ThumbNails from the Nokia project and make sure they didn't exist anywhere else in the tree since I was discarding the ones that I had real pictures for and wanted to be sure I'd really finished up all of the related work while I still had Psion 5 code in my head...
  • find-mbm which used magic.from_file to identify all of the Psion Series 5 multi-bitmap image files (expensively, until the second or third pass when I realized that I had all the evidence I needed that they only existed in _PAlbTN subdirectories, and could just edit the script to do a cheap path test first - effectively running file on a couple of hundred files instead of two hundred thousand.) This was just to generate filenames for the conversion script, it didn't do any of the work directly.

Conclusion

I now have three entirely different sets of tooling to handle index.xml that take very different approaches:

  • kpa-grep uses SQL queries on a sqlite cache of the entire index (read-only, and generates it by LXML-parsing the whole file if it's out of date)
  • pojkar does directly LXML parsing and rewriting (since it's used for uploads that used to be expensive, it does one parse up front and then operates on an internal tree, writing that out every time an upload succeeds for consistency/checkpointing)
  • kpa-insert &c. treat the index.xml as a very structured text file - and operate efficiently but not very safely, relying on my reading the diffs to confirm that the ad-hoc tools worked correctly regardless of not being proper.

Fortunately I've done all of the data-cleaning I intend to do for now, and the kpa-grep issue list is short and mostly releng, not features. I do eventually want a full suite of "manipulate images and tags" CLI tools, and I want them to be faster than 2.5s per operation11 - but I don't have a driving project that needs them yet - my photoblogging tools are already Fast Enough™.


  1. "Ergonomic" might be a better word than convenient, but I have a hard time saying that about XML. 

  2. This does require discipline about only using the tools when KPhotoAlbum itself isn't running, but that's not too big a deal for a personal database - and it's more about not doing updates in two places; it's "one program wins", not a file locking/corruption problem. 

  3. Most of the cleanliness is personal style, but lintian and pylint are part of that. This covers having a man page (using ronn to let me write them in Markdown) and tests (since it's a CLI tool that doesn't export a python API, cram lets me write a bunch of CLI/bash tests in Markdown/doctest style. 

  4. When I promoted it from "the stuff in my python/exif directory to an Actual Project, it needed a name - Flickor is the Swedish word for "girls", and "boys" is Pojkar (pronounced poy-car.) 

  5. Originally this was twitter support, then I added mastodon support, then twitter killed their registered-but-non-paying API use so I dropped the twitter support - which let me increase the post size significantly. This also simplified the code - I previously used bits of thok-ztwitgw but now I can just shell out to toot

  6. Notes actually went away, then came back, then got ACLed; they're also inconsistent: if you're in a search result or range of pictures (such as you get from clicking an image on someone's user page) the mouse only zooms and pans the image; if you edit the URL so it's just a single-image page, then you get rectangle-select back. I basically no longer use the feature and should probably do it directly client-side at some point, at which point the replacement tool should get described here. 

  7. It may be possible to pick a consistent output style at rendering time, but that might not be consistent with future KPhotoAlbum versions, and I just wanted to stick with something that worked reliably with the current output without doing too much (potentially pointless) futureproofing. 

  8. One subset was leftover trailcam pics from before I nailed down my trailcam workflow - most trailcam pics are discardable, false-positive triggers of the motion sensor due to wind - but initially I'd imported them into KPhotoAlbum first, and then deleted the discarded pictures - and this left dangling entries in index.xml that had no pictures, and left blank spots in the UI so I couldn't tag them even if I wanted to. 

  9. This is basically an easier version of the list-merge problem we used to ask as a MetaCarta interview question - because we actually did have a "combine multiple ranked search results" pass in our code that needed to be really efficient and it was a surprisingly relevant question - which is rare for "algorithm questions" in interviews. 

  10. In fact, it would have made a lot of sense to do this as a set of emacs macros, except that I didn't want to tackle the date parsing in elisp (and pymacs is years-dead.) 

  11. perhaps instead of pouring all of the attributes and tags into sqlite as a cache, I should instead be using it for an index that points back into the XML file, so I can do fast inserts as well as extracts? This will need a thorough test suite, and possibly an incremental backup system for the index to allow reconstruction to recover from design flaws. 

This was supposed to be a discussion of a handful of scripts that I wrote while searching for some particular long lost images... but the tale of quest/rathole itself "got away from me". The more mundane (and admittedly more interesting/relevant) part of the story will end up in a follow-on article.

Background

While poking at an SEO issue for my ice cream blog1 I noticed an oddity: a picture of a huge soft-serve cone on flickr that wasn't in my KPhotoAlbum archive. I've put a bunch of work into folding everything2 in to KPhotoAlbum, primarily because the XML format it uses is portable3 and straightforward4 to work with.

Since I wanted to use that picture in my KPhotoAlbum-centered ice cream blog5 I certainly could have just re-downloaded the picture, but one picture missing implied others (I eventually found 80 or so) and so I went down the rathole to solve this once and for all.

First Hints

The picture on flickr has some interesting details to work from:

  • A posting date of 2005-07-31 (which led me to some contemporary photos that I did have in my archive)
  • Tags for nokia6630 and lifeblog
  • A handwritten title (normally my uploads have a title that is just the on-camera filename, because they go via a laptop into KPhotoAlbum first, where I tag them for upload.)

As described in the Cindy's Drive-in story, this was enough to narrow it down to a post via the "Nokia Lifeblog Multimedia Diary" service, where I could take a picture from my Nokia 6630 phone, T9-type a short description, and have it get pushed directly to Flickr, with some automated tags and very primitive geolocation6. That was enough to convince me that there really was an entire category of missing pictures, but that it was confined to the Nokia 6630, and a relatively narrow window of time - one when I was driving around New England in my new Mini Cooper Convertible and taking lots of geolocated7 pictures.

Brute Force

I'd recently completed (mostly) a transition of my personal data hoard from a collection of homelab OpenAFS servers (2 primary machines with 8 large spinning-rust disks) to a single AsusStor device with a half dozen SSDs, which meant that this was a good chance to test out just how much of a difference this particular technology step function made - so I simply ran find -ls on the whole disk looking for any file from that day8:

$ time find /archive/ -ls 2>/dev/null |grep 'Jul 30  2005'

The first time through took five minutes and produced a little over a thousand files. Turns out this found things like a Safari cache from that day, dpkg metadata from a particular machine, mailing list archives from a few dozen lists that had posts on that exact day... and, entirely coincidentally, the last two files were in a nokia/sdb1/Images directory, and one of them was definitely the picture I wanted. (We'll get to the other one shortly.)

Since that worked so well, I figured I'd double check and see if there were any other places I had a copy of that file - as part of an interview question9 over a decade ago, I'd looked at the stats of my photo gallery and realized that image sizes (for JPGs) have surprisingly few duplicates, so I did a quick pass on size:

time find /archive -size 482597c -ls

Because I was searching the same 12 million files10 on a machine with 16G of RAM and very little competing use, this follow-up search took less than two minutes - all of the file metadata was (presumably) still in cache. This also turned up two copies - the one from the first pass, and one from what seems to be a flickr backup done with a Mac tool called "Bulkr"11 some time in 2010 (which didn't preserve flickr upload times, so it hadn't turned up in the first scan.) Having multiple copies was comforting, but it didn't include any additional metadata, so I went with the version that was clearly directly backed up from the memory of the Nokia phone itself.

That other file (side quest)

So I found 482597 Jul 30 2005 /archive/.../nokia/sdb1/Images/20050730.jpg and 3092 Jul 30 2005 /archive/.../nokia/sdb1/Images/_PAlbTN/20050730.jpg in that first pass. The 480k version was "obviously" big enough, and rendered fine; file reported the entirely sensible JPEG image data, Exif standard: [TIFF image data, little-endian, direntries=8, manufacturer=Nokia, model=6630, orientation=upper-left, xresolution=122, yresolution=130, resolutionunit=2], baseline, precision 8, 1280x960, components 3 which again looks like a normal-sized camera image. The 3k _PAlbTN/20050730.jpg version was some sort of scrap, right?12

I don't know what they looked like back then, but today the description said Psion Series 5 multi-bitmap image which suggested it was some kind of image, and that triggered my "I need to preserve this somehow" instinct13.

Wait, Psion? This is a Nokia... turns out that Psion created Symbian, pivoted to being "Symbian Ltd" and was a multi platform embedded OS (on a variety of phones and PDAs) until it got bought out by Nokia. So "Psion" is probably more historically accurate here.

The format is also called EPOC_MBM in the data preservation space, and looking at documentation from the author of psiconv it turns out that it's a container format for a variety of different formats - spreadsheets, notes, password stores - and for our purposes, "Paint Data". In theory I could have picked up psiconv itself, the upstream Subversion sources haven't been touched since 2014 but do contain Debian packaging, so it's probably a relatively small "sub-rathole"14... but the files just aren't that big and the format information is pretty clear, so I figured I'd go down the "convert english to python" path instead. It helps that I only need to handle small images, generated from a very narrow range of software releases (Nokia phones did get software updates but not that many and it was only a couple of years) so I could probably thread a fairly narrow path through the spec - and it wouldn't be hard to keep track of the small number of bytes involved at the hexdump level.

Vintage File Formats

The mechanically important part of the format is that the outer layers of metadata are 32 bit little endian unsigned integers, which are either identifiers, file offsets, or lengths. For identifiers, we have the added complexity that the documentation lists them as hex values directly, and to remove a manual reformatting step we want a helper function that takes "37 00 00 10" and interprets it correctly. So, we read the files with unpack("<L", stream.read(4))[0], and interpret the hex strings with int("".join(reversed(letters.split())), 16) which allows directly checking and skipping identifiers with statements like assert getL(...) == h2i("37 00 00 10")15. This is also a place where the fact that we're only doing thumbnail images helps - we have a consistent Header Section Layout tag, the same File Kind and Application ID each time, and that meant a constant Header Checksum - so we could confirm the checksum without ever actually calculating it.

Once we get past the header, we have the address of the Section Table Section16 which just points near the end of the current file - where we find a length of "1 entry" and a single pointer back to where we already were. (All this jumping around feels like a lot of overhead, but it's only about one percent of the file size.) That pointer brings us to the Paint Data Section which starts with a length (which helps us "account for" the other bytes in the file, since it covers everything up to the Section Table and an offset (which we can ignore since the subsequent data just stacks up until we get to the pixels.) Finally we get the x and y pixel dimensions, some theoretical physical dimensions (specified as having units of ¹/₁₄₄₀ of an inch, but always zero in my actual files) and then a "bits per dot" and "color vs greyscale" flag. Given that these are photo thumbnails, it isn't surprising that these are consistent at "16 bits per pixel" and "color", but the spec is vague about that (as is the psiconv code itself, which just does some rounded fractional values for bit sizes that are larger than the 1/2/4 bit "magic lookup table" values.)

Finally we get to an encoding flag. On the first pass through I only saw 0 "Plain Data" for this, which simplified things... until I did the full run and found that many of the chronologically later thumbnails17 instead had 3 meaning "16-bit RLE". The particular RLE mechanism is pretty simple: values below 128 are a repeat count, and the following pixel should be "used" N+1 times; in order to avoid the RLE making highly varying files larger, values from 128 to 255 do the reverse: the subsequent 256-N 16-bit pixels18 are just used directly with no expansion.

Ancient Pixels

While pixels are clearly labeled as 16 bit, we don't actually have any hints about which of those bits represent which colors. I tried a bunch of guesses that (with a couple of test images) were either too pink, too yellow, too magenta, or all of them at once. Finally I looked at the psiconv source - lib/psiconv/parse_image.c doesn't appear to directly handle 16 bit, it just has a fallback heuristic where red and green each get (16+2)/3 bits, and blue gets the rest, so you get 6/6/4 (which was one of the values I'd already guessed and discarded as "too pink".) To make sure it wasn't a more complicated misinterpretation, I just grabbed the upper 8 bits and used them for all three channels - for a snowy scene with a lot of white and black anyway, it looked pretty convincing, even if it was really just dumping everything but red (displaying it in monochrome probably made it easier to reinterpret, though.)

I also tried a few sample images that were also in the phone backup - flower.jpg was mostly yellow, blue.gif was shades of blue with white swirls - and still wasn't getting that far. At some point I realized that this was a kind of retrocomputing project and that perhaps I should be trying to figure out what "period" 16 bit pixel representations were - and wikipedia already had the answer! While there was a lot of "creativity" in smaller encodings, "RGB565" was basically it for 16 bit19. Since I'd already parameterized the bit lengths for the previous experiments, just dropping in rgbrange = [5, 6, 5] was enough to produce samples with convincing colors when compared to the original images. Victory! Now all I had to do was process the whole set. A little use of python3-magic20 let me identify which files were in this format, then convert the whole set.

Great, now I have all of these thumbnails. And as thumbnails they look pretty good! On closer review they even match the full-sized images I'd already recovered, which confirms that nothing else is missing from that particular camera phone. The other thing that really stands out from that review is that these really are only 42x36 and that is tiny, and if you enlarge them at all they actually get significantly worse. Now that I've used them to be sure that I have all of the originals: I've deleted all of the _PAlbTN directories from my photogallery.21

Conclusion

This was a fairly deep (even excessively deep) rathole for this class of problem - and there are different branches I would have taken if I were doing this in a professional context - but it resolves some (personal) questions that have been lingering for over a decade, and gives me some increased confidence in the integrity of my lifetime photo archive. Worth it.


  1. I mentioned the blog to some old friends who asked "can I just google ice cream blog eichin and find it?" and at the time, I assumed that would work - not knowing that Alfred Eichin patented an ice cream scoop in 1954 that dominates the web, partly because his name was engraved on many of them and they turn up on collector sites, etsy, and ebay. (Not a relation, as far as I am aware.) 

  2. I've folded previous photogalleries in, with tag and description conversions (even if that meant a lot of cut&paste), and included even terrible digital photos all the way back to the little 640x480 shots from my 1999-era Largan camera. 

  3. I've published tools like kpa-grep and also built personal cropping tools (that used the old flickr region-note feature) and auto-posting tools (that generate my current social media posts as well.) All of these work directly with the KPhotoAlbum XML format, typically using python lxml

  4. You've probably heard horrors about XML; while there are encoding issues (well handled by popular libraries - if you don't try and use regex you won't summon ZA̡͊͠͝LGΌ) the thing that matters here is that the model is very flat: a long list of images with a fixed vocabulary of attributes and a single list of (sets of) tags per image - no nesting, no CDATA, no entity cross-reference. 

  5. I literally run icecream-start shopname to grab all of the images tagged (with KPhotoAlbum Location tags) with that shop's name and assemble a first-draft markdown page that just assumes I want all of the pictures and will fill in text descriptions myself. 

  6. Originally the tags were just the real-time cell-tower ids, with a service that scanned participating flickr accounts and turned the "machine" tags into real-world locations afterwards. 

  7. I worked at MetaCarta - a geographic search company - at this time, so I had a professional interest, but we weren't actually acquired by Nokia until 5 years later. 

  8. Seems a bit crude, but the alternative is using touch to create two timestamp files and use -newer; I did run a quick test pass to catch the extra whitespace between the year and the day-of-month - since I also didn't want to turn this into another #awktober post

  9. The interview question was about cleaning up duplicates in a large but badly merged photogallery. The particular bit we were looking for was that you didn't need to do N² full-file comparisons on a terabyte of images when there were only 20k files involved; if you started with just comparing sizes, that was good, but we'd push a little harder and steer you towards comparing hashes in various ways. All straightforward stuff analagous to the kind of bulk data shuffling we were doing, without needing proprietary concepts like gazetteer imports... and most people had some concepts of digital photography at that point. The bit about sizes was realizing that if you shot "raw" most files would be the same uncompressed size, but JPGs are highly compressed and turned out to vary a lot - so as long as you did a full-file confirmation on each pair, using length as an initial discriminator was actually pretty good. (But really, you know about hashes, md5sum, that sort of thing, right? Especially for an infrastructure job where you've almost certainly downloaded a linux install ISO and checked the hashes?) 

  10. Since all of the archives involved are on one filesystem, I didn't need a filesystem cache to get this instantly - df -i reports IUsed and all of those correspond to what I was searching through, with little (and probably no) disk access at all. 

  11. As far as I can tell, bulkr only pulled down the "Original" images and named them from the flickr title, but didn't grab tags, comments, or geographic location. Fortunately that is still up on flickr for future preservation efforts. 

  12. I only finally got around to looking this up while writing this, turns out the internet believes that this is actually an abbreviation of Photo Album Thumb Nail - which is at least convincing, if not well documented. 

  13. Also, there were a number of these Psion "images" in my collection already - which KPhotoAlbum failed to render at all, just left unselectable blanks in the image view - which implied that if I did follow this thread to the end it would let me solve yet another archive quality issue... 

  14. If this were a work project, I'd have gone down the "update the package" path - mostly because at both MetaCarta and RightHand I had already built entire systems of plumbing to streamline the "build a package from upstream sources adding small rigorously tracked changes, and stuff it into a shared artifact repository" pipeline; I only have segments of that implemented in my homelab. 

  15. The actual code has more comments and variables-for-the-purpose-of-labelling because as I built it up I wanted to be clear on things like "I expect this to be a Header Section Layout but I got something else"; the documentation was clear enough (and the format simple enough) that there weren't that many experimental failures in the early stages, and by the time I got to the later stages where it would have been helpful I had already relaxed to the point of writing incomprehensible lines like seek(thing_offset) anyway. 

  16. Both the names and the indirection levels involved strongly suggest that whoever cooked up this format had been recently exposed to the ELF spec, with its Section Header Table and Program Header Table, and in fact Symbian E32Image turns out to be ELF. 

  17. My evidence-free theory here is that while phones of that era didn't get software updates very often, I do vaguely remember getting a few, so perhaps RLE support simply wasn't there as-shipped and was delivered as part of a later update, so only later images used it. 

  18. This was my only point of confusion from the documentation: it says "100-marker" in a context surrounded by other "obviously" hex numbers (with no 0x marker) and for some reason I missed that and interpreted 100 as decimal, which led to rather scrambled decoding until I checked the psiconv code itself - up until that point I'd actually done fairly well at implementing this by only looking at the specs, and I really can't blame the spec author for this one. 

  19. RGB565 was also known as "High Color" in Windows documentation of the era. (That page explains nominal human eyes being more green-sensitive and includes a sample image that attempts to justify that "the extra bit should be in green.) 

  20. "magic" refers to the magic number database used by the unix file utility to make a "heuristic but surprisingly good" fast guess as to what the contents of a file are (ignoring the name - remember, these Psion files all had .jpg or .gif extensions anyway, the directory name mattered but otherwise each thumbnail had exactly the same name as the image it was made from.) 

  21. I did keep them in the git repo for the conversion project - 400ish original thumbnails takes up 2M bytes, and they compress down to about half a meg - so there's no need to free up the space they take up, but there are good organizational reasons like "the photogallery should only have original images" to purge them from the gallery itself. This ends up guiding other clean-up and curation later on.