Justin du Coeur (jducoeur) wrote,
Justin du Coeur

For this, I have a computer science degree

[This weird ramble is kind of about programming, but this time it's introductory-level and generally useful, instead of the wizard-level stuff I usually talk about. We are going to teach a few basic programming concepts using my comic book collection as a motivating real-world example.]

There -- that project's been underway for about three years, and at least it's in decent shape before things have to go into storage.

I occasionally refer to Steve, the proprietor of The Outer Limits in Waltham, as my Comic Book Pusher. Some people think I'm kidding, but it's more a matter than I decided many years ago that comics are a less dangerous (if not necessarily cheaper) addiction than cocaine.

But the thing is, I am very much *not* a comic book "collector". I don't buy for value, or even completeness -- I just like reading them. So I tear through huge numbers of comics, and then set them aside. And once in a while, it occurs to me to sort them as I put them into boxes and stick them in the basement, so that later I might be able to re-read the best ones.

The problem is, while sorting a *few* comics (say, a few hundred) is easy, sorting many thousands of them is not. Hence, the last time I did a full sort of the entire pile was rather a while ago. "A while ago" being defined, in this case, as 1990. This is a Problem.

This is where the computer science degree comes in. The most important thing you learn in programming classes isn't how to program -- that becomes obsolete rather quickly as systems and languages change -- but rather, how to analyze algorithms. And the very first thing you learn is how fast the various sort algorithms run.

Most people, when given a bunch of things to sort, use some kind of Insertion Sort -- that is, you just put things in order, sticking them into place as you get to them. This works great for sorting anywhere up to about 50 comics, but once you get past about a handspan in width it slows down dramatically, because it starts taking a long time to find the right place to put each one. And in fact, we are taught in school that an insertion sort is "n squared" -- the amount of time it takes is proportional to the number of things to sort, squared. When you're sorting tens of thousands of comic books, that is a very long time indeed. (Sorting ten thousand comics this way takes literally a million times as long as sorting ten.)

The canonical fastest sorting algorithm is known, quite reasonably, as the Quick Sort. Conceptually, it's pretty straightforward. You take your pile of things, and create two buckets around a midpoint: in our case, the buckets would be "comics from A-M" and "comics from N-Z". Then repeat this for each bucket, so you'd wind up with four buckets in order: A-F, G-M, N-S, T-Z. Keep repeating until each bucket has only one comic book and *poof* -- it's all sorted! This works *great* in the computer -- indeed, in some programming languages it's basically a one-liner -- and it is "n log n": the amount of time it takes to sort everything is proportional to the number of items times the logarithm of that number (which is relatively small). This is more or less as fast as you can theoretically go. It has only one problem: it requires n buckets, and I do not have tens of thousands of tiny comic book boxes.

So for real-world problems, we have the Merge Sort, which isn't *quite* as fast as Quick Sort, but is still basically "n log n". For a merge sort, you start out by doing an insertion sort (just putting things in order) for as many things as you can easily do (in our case, about a handspan's worth of comics). Set those aside, and do it again for the next batch; repeat until everything is in little piles. Now merge together a reasonable number of piles -- take around 3-6 piles of comics, and just go from front to back, putting them together, which is extremely quick and gets you a *bigger* pile. Do that for all the small piles, so you now have some big piles. Now do the same thing for the big piles, so you wind up with one really big, completely sorted pile.

(Of course, none of this is a deep dark secret -- good librarians know this technique just as well as good programmers. But it occurred to me that many folks probably don't have cause to sort thousands of items very often, and might not know the trick.)

So when I've read 3-6 months' of comics and put them away, I typically do about two passes of this, so that I wind up with 1-3 longboxes, sorted and merged in order. What I *haven't* done since 1990 is continuing the process: take these now rather large piles, and keep merging them.

But everything is going into storage shortly, which means that all hope of *ever* seeing the collection fully sorted is Doomed Doomed Doomed if I don't make progress. So a few months ago I kicked back into motion the project that I had started before Jane died, to fully merge the whole thing. Sadly, I didn't get it all the way complete, and I need to stop now and focus on packing. But I've gotten to the point where I now have three *humongous* piles of longboxes, labeled runs A, B and C, which represent all the comics since 1990. After the move is done, I can begin to pull those out of storage, along with the pre-1990 run, merge the whole thing, and craft a really serious Querki app to inventory and sell most of it. (My comics are going to be one of the acid tests for Querki, and will help drive several generally interesting features.)

And the final result? I have 68 longboxes in the post-1990 run, along with 30-40 in the pre-1990 one, already in storage. In total, I'd guess that that's about 30,000 comics, enough that the collection *must* be kept directly on the slab, lest it break the house. A fair addiction, indeed...
Tags: comics, programming

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded