March 1st, 2013

device

Flexing my programmer muscles

[This one's gonna get pretty technical; be warned. It's kinda bragging (in the "look at the size of my brain" sense), but dammit, I have spent a *lot* of time on the bloody OP Compiler, and I need to get at least a little ego-boo out of it. Programmers may actually want to give it a deep read, since it wound up as an exercise in practical data-mining.]

I've talked before about the Order of Precedence Compiler project. I'm taking the old, flat-file OP, "compiling" it into a nice normalized database, and spitting it out into MySQL format for the database system that tpau picked out. That's mostly done: I'm reading in nearly all of the files successfully, writing them out, and we're able to at least mostly run the new system with the old data.

There's one huge snag, though: the data is *magnificently* inconsistent. This isn't really about typos -- while there are a few errors and inconsistencies here and there, the vast majority of the problem is much more pedestrian, with two main causes:
  • People change their names a lot.

  • A lot of SCAdians have hard-to-spell names, that they don't use consistently.
Why is this a problem? You'll recall that we have three sets of data -- the Court Reports (in the form originally recorded, more or less), the Alphabetical Listing (which generally is indexed by folks' preferred form of their name), and the List by Award (for each award, everybody who has ever received it, in order). It is not unusual for a given award to be recorded under two or even three different names in these three lists.

And the hell of it is, the system has no idea that those are the same person. This is where a hand-maintained system is very different from a program: it might be horribly obvious to a human that (picking an example at random) "Nigell Tarragon" in the alpha list is the same person as "Nigel Tarragon" in the court report, but the program doesn't know that. The names are different, period. And often it's not just one character difference: there are entire words missing, first and last names switched, or the *extremely* common case where somebody changes their name after getting their AoA.

The result was that the new award system was staggeringly full of duplicate entries: multiple records of somebody getting an award with different names, when it should be a single record with alternate names. What to do?

My original reaction was that I was going to have to mount a massive effort, recruit dozens of people to scour the data meticulously and look for these duplications. But that was going to take dozens or hundreds of man-hours, and would still be hugely error-prone. So about three weeks ago, I paused and decided to step up and Be a Programmer.
Collapse )
So that's the state of the data-cleanup. I invite those who like Data to come take a look at the current state of the synonym file -- my gigantic list of alternate personae, misspellings, and so on. It's over 2000 entries so far, nearly all of them found by the program. Many are fairly trivial -- by and large, I've been pretty conservative, only accepting proposals when I'm reasonably confident that they are correct -- but it's done a nice job of spotting completely different alternate personae that I just happen to know are right.

Tell me if you find any errors in the synonym file -- it wouldn't surprise me if there are a few, and I'd rather catch them now than later. (Although fixing the occasional problem in the new system won't be hard.) In the meantime, I'm continuing to plug away, and improve the data as much as I can in the relatively little time we have left before the new system goes live...