Tooling for working with KPhotoAlbum galleries
KPhotoAlbum has lots of built-in
features, but in practice the more convenient1 way to interface
with it from simple unix tools is to just operate on the index.xml
where all of the metadata is stored.2
kpa-grep
I started
kpa-grep
back in
2011, around when I hit 90k pictures (I'm over 200k now.) The
originally documented use case was kpa-grep --since "last week"
--tags "office"
which was probably for sorting work pictures out from
personal ones. (The fuzzy timedateparser
use was there from day
one; since then, I'm not sure I've used anything other than "last
week" or "last month", especially since I never implemented date
ranges.) I've worked on it in bursts; usually there's feedback
between trying to do something with a sub-gallery, trying to script
it, and then enhancing kpa-grep
to handle it. The most recent burst
added two features, primarily inspired by the
tooling around my Ice Cream
Blog -
- A
sqlite
-based cache of the XML file. Back in the day it took 6-8 seconds to parse the file, on a modern laptop with SSD and All The RAM it's more like 2.5 seconds - the sqlite processing takes a little longer than that but subsequent queries are near-instant, which makes it sensible to loop overkpa-grep
output and do morekpa-grep
processing on it. A typical "pictures are ready, create a dummy review post for the last ice cream shop with all pictures and some metadata" operation was over a minute without the cache, and is now typically 5-10 seconds even with a stale cache. - Better tag support - mostly fleshing out unimplemented combinations
of options, but in particular allowing
--tag
and--since
to filter--dump-tags
, which let me pick out the most recentLocations
which are taggedice cream
, filter out city names, and have a short list of ice cream shops to work with. (Coming soon: adding some explicit checks of them against which shops I've actually reviewed already.)
As far as I know I don't have any users, but nonetheless it is on
github, so I've put some effort into keeping it clean3; recently that's
also included coming up with a low-effort workflow for doing releases
and release artifacts. This is currently a shell script involving
debspawn build
, dpkg-parsechangelog
, and gh release upload
which
feels like an acceptable amount of effort for a single program with a
man page.
pojkar
pojkar
is a collection of
Flickr upload tools that work
off of KPhotoAlbum.4 The currently active tools are
sync-to-flickr
and auto-cropr
.
sync-to-flickr
sync-to-flickr
is the engine behind a simple workflow: when I'm
reviewing photos in KPhotoAlbum, I choose particular images for
posting by adding the Keyword
tag flickr
to the image. Once I've
completed a set and quit out of KPhotoAlbum, I run sync-to-flickr
sync
which looks for everything tagged flickr
, uploads it to Flickr
with a title, description, and rotation (and possibly map coordinates,
except there are none of those in the current gallery.) There's also
a retry mechanism (both flickr's network and mine have improved in the
last decade so this rarely triggers.) Once a picture has been
uploaded, a flickd
tag is added to it, so future runs know to skip
it.
After all of that, the app collects up the tags for the posted set of
pictures; since social media posting5 has length limits (and since
humans are reading the list) we favor longer names and names that
appear more often in the set; then we drop tags that are substrings
of other tags (dropping Concord
in favor of Concord Conservation
Land
since the latter implies the former well enough.) Finally we
truncate the list to fit in a post.
auto-cropr
Flickr has an obscure6 feature where you could select a rectangle on
a picture (in the web interface) and add a "note" to that region.
auto-cropr
used the API to look for recent instances of that which
contained a magic string - then picked up the geometry of the
rectangle and cropped just that area, posting it as a new flickr
picture - and then cross linking them, replacing the original comment
with a link to the new image. Basically this let you draw the
viewer's attention to a particular area and then let them click to
zoom in on it and get more commentary as well as a "closeup".
Note that these "views" are only on Flickr, I don't download or back them up at all (I should fix that.)
fix-kpa-missing/kpa-insert
As part of the Nokia 6630 image fixing project
there ended up being a couple of different cleanups which I needed to
review carefully, so I wanted the tools to produce diffable changes, which
lxml
doesn't really guarantee7. Currently, the XML written
out by KPhotoAlbum is pretty structured - in particular, any image
with no tags is a one-line <image ... />
and I was particularly
looking to make corrections to things that were fundamentally
untagged8/untaggable (for fix-kpa-missing
) or insert lines
that were already one-line-per-picture, I just had to get them in the
right place.
When I started the image recovery, I ended up just adding a bunch of
images with their original datestamps (from 2005), but KPhotoAlbum
just added them to the end of the index (since they were "new" and I
don't use the sorting features.) So I had the correct lines for
each image (which checksums and dimensions), I could just chop them
out of the file. Then kpa-insert
takes these lines, and walks
through the main index as well. For basically any line that doesn't
begin with <image
it just copies it through unchanged to the new
index; when it finds an image line, it grabs the attributes
(startDate
, md5sum
, and pathname specifically) and then checks
them against the current head of the insertion list9.
Basically, if the head of the index was newer than the head of the
insertions, copy insertions over until that's no longer true. If they
match exactly - the original version just bailed so I could look at
them, then once I figured out that they really were duplicates, I
changed it to output rm
commands for the redundant files (and only
kept the "more original" line from the original index.)
The output was a diffable replacement index that I could review, and
check that the "neighbor" <image>
entries made sense, and that
nothing was getting added from elsewhere in the tree, and other basic
"eyeball" checks. Since I had to do this review anyway to make sure
I hadn't made any mistakes of intent, it made sense to write the
code in a "direct but brittle" style - anything weird, just bail with
a good traceback; I wouldn't even look at the diffs until the code
"didn't find anything weird." That also meant that I'd done the least
amount of work10 necessary to get the right result - basically a
degenerate case of Test Driven Development, where there's one input
(my existing index) and one test (does the new index look right.)
I also didn't have any of my usual user interface concerns - noone (not even me) was ever going to run this code after making this one change. I did keep things relatively clean with small helper functions because I expected to mine it for snippets for later problems in the same space - which I did, almost immediately.
For fix-kpa-missing
, I'd noticed some "dead space" in the main
KPhotoAlbum thumbnail view, and figured that it was mostly the result
of an old trailcam8 project. I was nervous about "losing"
metadata that might point me at images I should instead be trying to
recover, but here was a subset that I knew really were (improperly but
correctly) discarded images - wouldn't it be nice to determine that
they were the only missing images and clean it up once and for all?
So, the same "only look at <image
lines" code from kpa-insert
,
extract the pathname from the attributes, and just check if the file
exists; I could look for substrings of the pathname to determine that
it was a trailcam pic and was "OK", plus I could continue with the
"direct but brittle" approach and check that each stanza I was
removing didn't have any tags/options - but just blow up if it found
them. Since it found none, I knew that
- I had definitely not (mis-)tagged any of the discarded pictures
- I didn't have to write the options-handling code at all. (I suspect I will eventually need this, but the tools that are likely to need it will have other architectural differences, so it makes sense to hold off for now.)
There were a couple of additional scripts cobbled up out of these bits:
fix-kpa-PAlbTN
which looked for Photo Album ThumbNails from the Nokia project and make sure they didn't exist anywhere else in the tree since I was discarding the ones that I had real pictures for and wanted to be sure I'd really finished up all of the related work while I still had Psion 5 code in my head...find-mbm
which usedmagic.from_file
to identify all of thePsion Series 5 multi-bitmap image
files (expensively, until the second or third pass when I realized that I had all the evidence I needed that they only existed in_PAlbTN
subdirectories, and could just edit the script to do a cheap path test first - effectively runningfile
on a couple of hundred files instead of two hundred thousand.) This was just to generate filenames for the conversion script, it didn't do any of the work directly.
Conclusion
I now have three entirely different sets of tooling to handle
index.xml
that take very different approaches:
kpa-grep
uses SQL queries on asqlite
cache of the entire index (read-only, and generates it by LXML-parsing the whole file if it's out of date)pojkar
does directly LXML parsing and rewriting (since it's used for uploads that used to be expensive, it does one parse up front and then operates on an internal tree, writing that out every time an upload succeeds for consistency/checkpointing)kpa-insert
&c. treat theindex.xml
as a very structured text file - and operate efficiently but not very safely, relying on my reading the diffs to confirm that the ad-hoc tools worked correctly regardless of not being proper.
Fortunately I've done all of the data-cleaning I intend to do for
now, and the kpa-grep
issue
list is short
and mostly releng, not features. I do eventually want a full suite of
"manipulate images and tags" CLI tools, and I want them to be faster
than 2.5s per operation11 - but I don't have a driving project
that needs them yet - my photoblogging tools are already Fast
Enoughâ„¢.
-
"Ergonomic" might be a better word than convenient, but I have a hard time saying that about XML. ↩
-
This does require discipline about only using the tools when KPhotoAlbum itself isn't running, but that's not too big a deal for a personal database - and it's more about not doing updates in two places; it's "one program wins", not a file locking/corruption problem. ↩
-
Most of the cleanliness is personal style, but
lintian
andpylint
are part of that. This covers having a man page (usingronn
to let me write them in Markdown) and tests (since it's a CLI tool that doesn't export a python API,cram
lets me write a bunch of CLI/bash
tests in Markdown/doctest
style. ↩ -
When I promoted it from "the stuff in my
python/exif
directory to an Actual Project, it needed a name - Flickor is the Swedish word for "girls", and "boys" is Pojkar (pronounced poy-car.) ↩ -
Originally this was twitter support, then I added mastodon support, then twitter killed their registered-but-non-paying API use so I dropped the twitter support - which let me increase the post size significantly. This also simplified the code - I previously used bits of thok-ztwitgw but now I can just shell out to
toot
. ↩ -
Notes actually went away, then came back, then got ACLed; they're also inconsistent: if you're in a search result or range of pictures (such as you get from clicking an image on someone's user page) the mouse only zooms and pans the image; if you edit the URL so it's just a single-image page, then you get rectangle-select back. I basically no longer use the feature and should probably do it directly client-side at some point, at which point the replacement tool should get described here. ↩
-
It may be possible to pick a consistent output style at rendering time, but that might not be consistent with future KPhotoAlbum versions, and I just wanted to stick with something that worked reliably with the current output without doing too much (potentially pointless) futureproofing. ↩
-
One subset was leftover trailcam pics from before I nailed down my trailcam workflow - most trailcam pics are discardable, false-positive triggers of the motion sensor due to wind - but initially I'd imported them into KPhotoAlbum first, and then deleted the discarded pictures - and this left dangling entries in
index.xml
that had no pictures, and left blank spots in the UI so I couldn't tag them even if I wanted to. ↩↩ -
This is basically an easier version of the list-merge problem we used to ask as a MetaCarta interview question - because we actually did have a "combine multiple ranked search results" pass in our code that needed to be really efficient and it was a surprisingly relevant question - which is rare for "algorithm questions" in interviews. ↩
-
In fact, it would have made a lot of sense to do this as a set of emacs macros, except that I didn't want to tackle the date parsing in elisp (and
pymacs
is years-dead.) ↩ -
perhaps instead of pouring all of the attributes and tags into
sqlite
as a cache, I should instead be using it for an index that points back into the XML file, so I can do fast inserts as well as extracts? This will need a thorough test suite, and possibly an incremental backup system for the index to allow reconstruction to recover from design flaws. ↩