Log in

No account? Create an account
Previous Entry Share Next Entry
For this, I have a computer science degree
[This weird ramble is kind of about programming, but this time it's introductory-level and generally useful, instead of the wizard-level stuff I usually talk about. We are going to teach a few basic programming concepts using my comic book collection as a motivating real-world example.]

There -- that project's been underway for about three years, and at least it's in decent shape before things have to go into storage.

I occasionally refer to Steve, the proprietor of The Outer Limits in Waltham, as my Comic Book Pusher. Some people think I'm kidding, but it's more a matter than I decided many years ago that comics are a less dangerous (if not necessarily cheaper) addiction than cocaine.

But the thing is, I am very much *not* a comic book "collector". I don't buy for value, or even completeness -- I just like reading them. So I tear through huge numbers of comics, and then set them aside. And once in a while, it occurs to me to sort them as I put them into boxes and stick them in the basement, so that later I might be able to re-read the best ones.

The problem is, while sorting a *few* comics (say, a few hundred) is easy, sorting many thousands of them is not. Hence, the last time I did a full sort of the entire pile was rather a while ago. "A while ago" being defined, in this case, as 1990. This is a Problem.

This is where the computer science degree comes in. The most important thing you learn in programming classes isn't how to program -- that becomes obsolete rather quickly as systems and languages change -- but rather, how to analyze algorithms. And the very first thing you learn is how fast the various sort algorithms run.

Most people, when given a bunch of things to sort, use some kind of Insertion Sort -- that is, you just put things in order, sticking them into place as you get to them. This works great for sorting anywhere up to about 50 comics, but once you get past about a handspan in width it slows down dramatically, because it starts taking a long time to find the right place to put each one. And in fact, we are taught in school that an insertion sort is "n squared" -- the amount of time it takes is proportional to the number of things to sort, squared. When you're sorting tens of thousands of comic books, that is a very long time indeed. (Sorting ten thousand comics this way takes literally a million times as long as sorting ten.)

The canonical fastest sorting algorithm is known, quite reasonably, as the Quick Sort. Conceptually, it's pretty straightforward. You take your pile of things, and create two buckets around a midpoint: in our case, the buckets would be "comics from A-M" and "comics from N-Z". Then repeat this for each bucket, so you'd wind up with four buckets in order: A-F, G-M, N-S, T-Z. Keep repeating until each bucket has only one comic book and *poof* -- it's all sorted! This works *great* in the computer -- indeed, in some programming languages it's basically a one-liner -- and it is "n log n": the amount of time it takes to sort everything is proportional to the number of items times the logarithm of that number (which is relatively small). This is more or less as fast as you can theoretically go. It has only one problem: it requires n buckets, and I do not have tens of thousands of tiny comic book boxes.

So for real-world problems, we have the Merge Sort, which isn't *quite* as fast as Quick Sort, but is still basically "n log n". For a merge sort, you start out by doing an insertion sort (just putting things in order) for as many things as you can easily do (in our case, about a handspan's worth of comics). Set those aside, and do it again for the next batch; repeat until everything is in little piles. Now merge together a reasonable number of piles -- take around 3-6 piles of comics, and just go from front to back, putting them together, which is extremely quick and gets you a *bigger* pile. Do that for all the small piles, so you now have some big piles. Now do the same thing for the big piles, so you wind up with one really big, completely sorted pile.

(Of course, none of this is a deep dark secret -- good librarians know this technique just as well as good programmers. But it occurred to me that many folks probably don't have cause to sort thousands of items very often, and might not know the trick.)

So when I've read 3-6 months' of comics and put them away, I typically do about two passes of this, so that I wind up with 1-3 longboxes, sorted and merged in order. What I *haven't* done since 1990 is continuing the process: take these now rather large piles, and keep merging them.

But everything is going into storage shortly, which means that all hope of *ever* seeing the collection fully sorted is Doomed Doomed Doomed if I don't make progress. So a few months ago I kicked back into motion the project that I had started before Jane died, to fully merge the whole thing. Sadly, I didn't get it all the way complete, and I need to stop now and focus on packing. But I've gotten to the point where I now have three *humongous* piles of longboxes, labeled runs A, B and C, which represent all the comics since 1990. After the move is done, I can begin to pull those out of storage, along with the pre-1990 run, merge the whole thing, and craft a really serious Querki app to inventory and sell most of it. (My comics are going to be one of the acid tests for Querki, and will help drive several generally interesting features.)

And the final result? I have 68 longboxes in the post-1990 run, along with 30-40 in the pre-1990 one, already in storage. In total, I'd guess that that's about 30,000 comics, enough that the collection *must* be kept directly on the slab, lest it break the house. A fair addiction, indeed...

  • 1
Reminds me of the complex and arduous task of sorting CCG cards, which is further complicated by the fact that you want near-instant access to any given card *and* are constantly taking cards *out* of order.

What is the ordering relation?

Heh. *That* is why I can't delegate this part of the job. The sort is alphabetical by title -- but "title" is a pretty subtle problem, requiring that you know the material quite well.

This covers some fairly straightforward problems -- eg, Welcome to Tranquility is filed under "T", which just needs an arbitrary but consistent answer -- but some that require at least a solid understanding of what the book *is*. For example, that Adventure Comics is mostly sorted under "S" for "Superman" -- and indeed, that there is a ten-year run when it gets interfiled with Superman, Action and Man of Steel, since they essentially comprised a single weekly comic -- but old issues might be under "L" for "Legion of Super-Heroes".

And of course, then there's the totally impossible ones unless you know the comics, like the fact that Strange Kiss, Strange Kisses and Strange Killings all get filed under "G" -- because that series of mini-series eventually resolved into the ongoing story Gravel. Or the way that Martha Washington Goes to War also gets filed under "G" -- because it is a sequel to the classic story Give Me Liberty.

Truth to tell, even I have trouble keeping track of it sometimes. Hence, one of the decisions I made *many* years ago was to begin crafting the comic book database, initially as an "authority list" (I think that's the correct librarian jargon) to record my alphabetization decisions. So the Series table has both Title (the readable title) and Sorted As (which may have precious little to do with the cover). And individual Issues have a Sorted Under field, to record cases where, eg, I only bought one issue of this comic because it was crossing over with something I actually gave a damn about. Most importantly, though, it means that when I find myself trying to remember what I decided, I can pull up the database and at least be consistent...

100+ longboxes?

Dude your protestations to the contrary, like it or not, you ARE a collector.

...or at least a comic book hoarder. :)

Edited at 2013-03-20 08:27 pm (UTC)

Kate would probably agree with the latter, but I'm mainly a reader: I actively want to get rid of *most* of that pile. (Admittedly, probably keeping a good deal, but just a fraction of the total.) While there's a modest amount of geek cred in having the most insane comic book stash, there isn't enough to be worth keeping all those.

I suspect my acquisitions will begin to taper off in the next few years, as tablets and apps get good enough that there is no longer any strong reason to buy most comics in paper form. And if I'm *really* lucky, Marvel will decide to do something as irritating as DC did, convincing me to drop the entire line -- that would help a good deal...

(Deleted comment)

As a non-trained programmer...

a sort of "hedge wizard" coder, if you will: thank you *very* much for that explanation of sorting algorithms. I knew this trick, just not its programming application.

Not only did you teach me something, I now more fully appreciate Monday's xkcd cartoon.

Edited at 2013-03-20 08:30 pm (UTC)

Re: As a non-trained programmer...

Hah! I actually hadn't gotten to that one yet. Delightful bit of coincidence. (And now I've had to go look up "bogosort" myself...)

(Deleted comment)
Wow. It's easy to say from distance and perspective, but I don't know if I could have stuck around that facility as long as you did. Driving away efficient workers seems like a great way to ruin a business. They should have had you teach the techniques at every step of the way to improve everyone's efficiency.

Edited at 2013-03-20 11:06 pm (UTC)

One other point is that O-notation only refers to the number of comparisons required (with the typical and worst cases sometimes being different polynomials). It's really easy on modern systems to end up with memory fetches dominating the runtime. Both quicksort and merge sort can maintain locality of reference, but naive implementations will probably thrash the caches.

Oh, sure, and I probably wouldn't ever write one myself for production code. This is something that was really drilled in at Memento: designs that work beautifully for ten thousand objects can be *really* miserable when you're trying to process a hundred million.

Fortunately, most modern libraries have decently good implementations of these algorithms, at least for medium-sized use. But this sort of thing is one of the numerous reasons why Querki is specifically focused on and limited to *small* datasets initially. I trust myself to be able to write a good system for managing 50k objects, but want serious DB programmers on-board before we try to tackle enterprise-scale data...

For those who find themselves enjoying the ideas of algorithms without the specifics of programming languages, you might want to check out a couple of books:

Computational Fairy Tales - http://is.gd/yWf7Av
Best Practices of Spell Design - http://is.gd/ALQEM4

Ooooh -- those look delightful! Thanks for the pointers...

I miss Outer Limits. There is no comic book shop around here (Delaware, where we relocated).

Sympathies. One of the nice thing about living in Geek Central, MA is that there are a fair number of comics stores around.

Now that I'm in Somerville, Outer Limits is getting a little inconvenient for me, truth to tell; I may eventually have to switch stores. But I'm in no rush -- it's a fine thing to have a longtime relationship with a good dealer, and I've been shopping at Steve's for something like 25 years now...

In, I think, 1992, we were over at your house and Jane had a large bunch of index cards that needed sorting. She figured that she would basically do a merge sort, maybe getting help with the individual shards, but was grumbling about the work of the merge that she'd have to do herself at the end.

I pointed out that she could do a bucket sort: Deal out all the cards into piles "starts with A", "starts with B", etc; get each pile sorted; and then at the end just stack all the piles on top of each other. She was gob-smacked, and told me "You've changed the way I will sort things for the rest of my life."

I hadn't thought of that day in years... :-)

Heh -- I had not remembered that, but it does sound very much like Jane.

(Who was in many ways an untrained programmer: she never studied the stuff, but *thought* in the right ways, and had a knack for picking up techniques that were useful. My rule of thumb was that I would usually intuit a solution quicker than her, but once I pointed it out she would always apply it much more rigorously and correctly than me ever-after. This is why I would usually win a game the first few times, but she'd usually wind up winning far more often in the long run...)

I haven't read *all* the comments, but I think I got through enough of them to still think I have something to say.

If the ultimate goal is to sell them,

a) don't do any more sorting unless you enjoy it more than anything else you could do with that time.

b) when you're ready to sell, put the comics up one box at a time, *indexing* but not necessarily sorting each box as you do so (though you can sort within a box while listing them if you want to). Include markers to make it easier to find a section within a box.

c) repeat until done.

All remaining sorting takes place on the computer; fetching requires referring to multiple indices -- but you don't have to fetch something until it's actually going to sell, so you have a reward for the work of fetching it.

Actually, the ultimate goal is to find good homes for most of them, which isn't quite the same thing as just selling. (So for example, simply giving a large number to libraries and taking the tax writeoff is an option that I'm seriously considering.)

And the indexing approach you're describing actually isn't as much of a win as you think, because I'm going to be adding a feature to Querki to make indexing much faster for coherent runs than individual issues -- basically, make it easy to say "add Batman 287-322" as a single entry. Also, the effort of *retrieving* a run in order to sell it (because let's get real, it is much harder to sell back issues if they aren't sold as coherent blocks) would be much larger, simply because of the effort of physically shifting all those boxes.

I haven't done a formal analysis of it, but I *think* that, from an overall efficiency POV, the pre-sort is still a win. Not nearly as huge a win going from four runs to one as it was when I had dozens of runs, but still a win.

Most importantly, though, I might as well. I still have to inventory everything, and separate it into the three big buckets ("definitely sell", "definitely keep" and "reconsider later"), and it's actually not much more work to do the final merge while that is happening: take all four remaining runs, and simultaneously merge them, inventory them, and re-separate them. That will have the optimal final result, and isn't much harder than handling the runs separately...

  • 1