I've talked before about the Order of Precedence Compiler project. I'm taking the old, flat-file OP, "compiling" it into a nice normalized database, and spitting it out into MySQL format for the database system that tpau picked out. That's mostly done: I'm reading in nearly all of the files successfully, writing them out, and we're able to at least mostly run the new system with the old data.
There's one huge snag, though: the data is *magnificently* inconsistent. This isn't really about typos -- while there are a few errors and inconsistencies here and there, the vast majority of the problem is much more pedestrian, with two main causes:
- People change their names a lot.
- A lot of SCAdians have hard-to-spell names, that they don't use consistently.
And the hell of it is, the system has no idea that those are the same person. This is where a hand-maintained system is very different from a program: it might be horribly obvious to a human that (picking an example at random) "Nigell Tarragon" in the alpha list is the same person as "Nigel Tarragon" in the court report, but the program doesn't know that. The names are different, period. And often it's not just one character difference: there are entire words missing, first and last names switched, or the *extremely* common case where somebody changes their name after getting their AoA.
The result was that the new award system was staggeringly full of duplicate entries: multiple records of somebody getting an award with different names, when it should be a single record with alternate names. What to do?
My original reaction was that I was going to have to mount a massive effort, recruit dozens of people to scour the data meticulously and look for these duplications. But that was going to take dozens or hundreds of man-hours, and would still be hugely error-prone. So about three weeks ago, I paused and decided to step up and Be a Programmer.
The resulting algorithm is a Thing 'O Beauty, and shows that I did pick something up from those years of working at Memento. Indeed, while this was kind of a fun Scala programming project, the *right* tool for the job would have been Memento's Pipeline, which is a serious data-mining system perfectly suited for the problem. But instead it became a fun exercise in functional programming, as follows.
- Start by taking all the "recognitions" -- one or more files saying that Person P got Award A on Date D. (This comes to 21,296 distinct records.)
- Remove all of the recognitions that are "complete" -- that is, we already have matching records in the Alpha, Chrono and List files. (This originally left 18,514 that were incomplete.)
Tangent: at this point, I had to get smarter about the definition of "incomplete". If an award is out-Kingdom, we don't expect to find it in an Eastern Court Report. If the award is Baronial, we don't expect to find it in a Court Report. If there isn't a Listing file for this particular Award, then we obviously aren't going to find it in the Listings. Applying these rules knocked the total down to a somewhat-better but still intimidating 16,244 incomplete records.
- Sort them all by Date.
Tangent: "date" is, sadly, less concrete than you might think. Especially in the older data, we often have only partial data -- records whose date is "05/??/1992", or even "??/??/1980". I wound up having to build my own data structure for dates, to reflect this properly, and introduce the concept of "match", which is a sort of weak "equals". Two dates match if they can be *interpreted* as the same date, given the ambiguity. For purposes of this algorithm, we sort ambiguous dates towards the front.
- Now the fun begins. For each recognition, in order, gather up all of the ones after it that have a matching date, and are for the same award. (Possibly by a different name -- many awards have variant names as well.) Take only the ones whose "holes" line up -- for instance, if the head record is an Alpha that is missing a Court report, choose the ones that have a Court Report but no Alpha. The result is a CandidatePair, with a Target (the original head record) and a Candidate that might be the same person with a different name.
- Okay, now you have a list of CandidatePairs. Group them by Target. (Specifically, where the CandidatePair had a *persona*, we are now grouping by *person* -- a collection of Personae.) So now you have MergeOptions -- for each Target Person, the MergeOptions is the list of all CandidatePairs that we have found for that person; each CandidatePair is essentially a proposed alternate name for that Person.
- Now sort all of the MergeOptions by the strength of its *best* match. "Strength" is the key to the whole game. It is a composite of the number of matches, plus the edit distance of the names.
Tangent: why number of matches? Think about the data. If I just have, say, a Court Report saying that Flubulga the Unwashed got an AoA on the same day that the Alpha listing says Heinrich Eiskopf got one, that's very weak data -- many people get awards on the same day, so it could easily be coincidence. But if we have a record from the following year that Flubulga got a Silver Crescent on the same day that the Alpha list says Heinrich got one, that becomes an *incredibly* strong indicator that Flubulga changed his name to Heinrich sometime afterwards. I haven't run any real statistics here, but I'd guess that a single data point gives a confidence of maybe 15%, but two gives you better than 98%. I don't think there were any exceptions.
Tangent: "edit distance" is one of those tools that is crucial to data-mining, but not well enough known. Basically, it is the algorithmic definition of "these strings look alike". There are a lot of algorithms -- I wound up using Levenshtein Distance, which is one of the older ones but I found a four-line open-source Scala implementation of it. In practical terms, if the names of the target and candidate have an edit distance of less than about ten, it is highly likely (> 95%) that the Court Report just has a misspelling of the name.
- Okay, now we have all of the MergeOptions, sorted by strength. Now write out a new names.conf file, with the best matches embedded. Names.conf is the point of the exercise: it is my "synonyms" file, saying that Person P has alternate names A, B and C. For each match, write out a synonym entry, with a comment describing the evidence why the program thinks these are a good match.
- Go through names.conf by hand, taking a quick look at each proposal. If it looks likely, leave it in place; if not, delete it. (In later go-rounds, I added the ability to put antonyms in names.conf -- statements that Person P is *not* known as X. This lets me filter those false-positives out in later runs.)
- Repeat several times. Each time, we get fewer proposals. Then widen the net slightly, allowing in slightly weaker matches. Repeat again and again and again.
So far, I've knocked it down to 11,763 incomplete records -- between all of this, I've managed to fix up something like 5000 mismatched records that would have been gumming up the database. I wound up spending two full days coding the damned thing (and I'm still tweaking and tuning), but I'm quite happy with where it's coming out.
So that's the state of the data-cleanup. I invite those who like Data to come take a look at the current state of the synonym file -- my gigantic list of alternate personae, misspellings, and so on. It's over 2000 entries so far, nearly all of them found by the program. Many are fairly trivial -- by and large, I've been pretty conservative, only accepting proposals when I'm reasonably confident that they are correct -- but it's done a nice job of spotting completely different alternate personae that I just happen to know are right.
Tell me if you find any errors in the synonym file -- it wouldn't surprise me if there are a few, and I'd rather catch them now than later. (Although fixing the occasional problem in the new system won't be hard.) In the meantime, I'm continuing to plug away, and improve the data as much as I can in the relatively little time we have left before the new system goes live...