Log in

No account? Create an account
Previous Entry Share Next Entry
Flexing my programmer muscles
[This one's gonna get pretty technical; be warned. It's kinda bragging (in the "look at the size of my brain" sense), but dammit, I have spent a *lot* of time on the bloody OP Compiler, and I need to get at least a little ego-boo out of it. Programmers may actually want to give it a deep read, since it wound up as an exercise in practical data-mining.]

I've talked before about the Order of Precedence Compiler project. I'm taking the old, flat-file OP, "compiling" it into a nice normalized database, and spitting it out into MySQL format for the database system that tpau picked out. That's mostly done: I'm reading in nearly all of the files successfully, writing them out, and we're able to at least mostly run the new system with the old data.

There's one huge snag, though: the data is *magnificently* inconsistent. This isn't really about typos -- while there are a few errors and inconsistencies here and there, the vast majority of the problem is much more pedestrian, with two main causes:
  • People change their names a lot.

  • A lot of SCAdians have hard-to-spell names, that they don't use consistently.
Why is this a problem? You'll recall that we have three sets of data -- the Court Reports (in the form originally recorded, more or less), the Alphabetical Listing (which generally is indexed by folks' preferred form of their name), and the List by Award (for each award, everybody who has ever received it, in order). It is not unusual for a given award to be recorded under two or even three different names in these three lists.

And the hell of it is, the system has no idea that those are the same person. This is where a hand-maintained system is very different from a program: it might be horribly obvious to a human that (picking an example at random) "Nigell Tarragon" in the alpha list is the same person as "Nigel Tarragon" in the court report, but the program doesn't know that. The names are different, period. And often it's not just one character difference: there are entire words missing, first and last names switched, or the *extremely* common case where somebody changes their name after getting their AoA.

The result was that the new award system was staggeringly full of duplicate entries: multiple records of somebody getting an award with different names, when it should be a single record with alternate names. What to do?

My original reaction was that I was going to have to mount a massive effort, recruit dozens of people to scour the data meticulously and look for these duplications. But that was going to take dozens or hundreds of man-hours, and would still be hugely error-prone. So about three weeks ago, I paused and decided to step up and Be a Programmer.

The resulting algorithm is a Thing 'O Beauty, and shows that I did pick something up from those years of working at Memento. Indeed, while this was kind of a fun Scala programming project, the *right* tool for the job would have been Memento's Pipeline, which is a serious data-mining system perfectly suited for the problem. But instead it became a fun exercise in functional programming, as follows.

  • Start by taking all the "recognitions" -- one or more files saying that Person P got Award A on Date D. (This comes to 21,296 distinct records.)

  • Remove all of the recognitions that are "complete" -- that is, we already have matching records in the Alpha, Chrono and List files. (This originally left 18,514 that were incomplete.)
    Tangent: at this point, I had to get smarter about the definition of "incomplete". If an award is out-Kingdom, we don't expect to find it in an Eastern Court Report. If the award is Baronial, we don't expect to find it in a Court Report. If there isn't a Listing file for this particular Award, then we obviously aren't going to find it in the Listings. Applying these rules knocked the total down to a somewhat-better but still intimidating 16,244 incomplete records.

  • Sort them all by Date.
    Tangent: "date" is, sadly, less concrete than you might think. Especially in the older data, we often have only partial data -- records whose date is "05/??/1992", or even "??/??/1980". I wound up having to build my own data structure for dates, to reflect this properly, and introduce the concept of "match", which is a sort of weak "equals". Two dates match if they can be *interpreted* as the same date, given the ambiguity. For purposes of this algorithm, we sort ambiguous dates towards the front.

  • Now the fun begins. For each recognition, in order, gather up all of the ones after it that have a matching date, and are for the same award. (Possibly by a different name -- many awards have variant names as well.) Take only the ones whose "holes" line up -- for instance, if the head record is an Alpha that is missing a Court report, choose the ones that have a Court Report but no Alpha. The result is a CandidatePair, with a Target (the original head record) and a Candidate that might be the same person with a different name.

  • Okay, now you have a list of CandidatePairs. Group them by Target. (Specifically, where the CandidatePair had a *persona*, we are now grouping by *person* -- a collection of Personae.) So now you have MergeOptions -- for each Target Person, the MergeOptions is the list of all CandidatePairs that we have found for that person; each CandidatePair is essentially a proposed alternate name for that Person.

  • Now sort all of the MergeOptions by the strength of its *best* match. "Strength" is the key to the whole game. It is a composite of the number of matches, plus the edit distance of the names.
    Tangent: why number of matches? Think about the data. If I just have, say, a Court Report saying that Flubulga the Unwashed got an AoA on the same day that the Alpha listing says Heinrich Eiskopf got one, that's very weak data -- many people get awards on the same day, so it could easily be coincidence. But if we have a record from the following year that Flubulga got a Silver Crescent on the same day that the Alpha list says Heinrich got one, that becomes an *incredibly* strong indicator that Flubulga changed his name to Heinrich sometime afterwards. I haven't run any real statistics here, but I'd guess that a single data point gives a confidence of maybe 15%, but two gives you better than 98%. I don't think there were any exceptions.

    Tangent: "edit distance" is one of those tools that is crucial to data-mining, but not well enough known. Basically, it is the algorithmic definition of "these strings look alike". There are a lot of algorithms -- I wound up using Levenshtein Distance, which is one of the older ones but I found a four-line open-source Scala implementation of it. In practical terms, if the names of the target and candidate have an edit distance of less than about ten, it is highly likely (> 95%) that the Court Report just has a misspelling of the name.

  • Okay, now we have all of the MergeOptions, sorted by strength. Now write out a new names.conf file, with the best matches embedded. Names.conf is the point of the exercise: it is my "synonyms" file, saying that Person P has alternate names A, B and C. For each match, write out a synonym entry, with a comment describing the evidence why the program thinks these are a good match.

  • Go through names.conf by hand, taking a quick look at each proposal. If it looks likely, leave it in place; if not, delete it. (In later go-rounds, I added the ability to put antonyms in names.conf -- statements that Person P is *not* known as X. This lets me filter those false-positives out in later runs.)

  • Repeat several times. Each time, we get fewer proposals. Then widen the net slightly, allowing in slightly weaker matches. Repeat again and again and again.

So far, I've knocked it down to 11,763 incomplete records -- between all of this, I've managed to fix up something like 5000 mismatched records that would have been gumming up the database. I wound up spending two full days coding the damned thing (and I'm still tweaking and tuning), but I'm quite happy with where it's coming out.

So that's the state of the data-cleanup. I invite those who like Data to come take a look at the current state of the synonym file -- my gigantic list of alternate personae, misspellings, and so on. It's over 2000 entries so far, nearly all of them found by the program. Many are fairly trivial -- by and large, I've been pretty conservative, only accepting proposals when I'm reasonably confident that they are correct -- but it's done a nice job of spotting completely different alternate personae that I just happen to know are right.

Tell me if you find any errors in the synonym file -- it wouldn't surprise me if there are a few, and I'd rather catch them now than later. (Although fixing the occasional problem in the new system won't be hard.) In the meantime, I'm continuing to plug away, and improve the data as much as I can in the relatively little time we have left before the new system goes live...

  • 1
My wife and I got our AoA's, Silver Crescents, Keystones and Pelicans on the same days. :) Can you tell we work a lot together? :)
-- Dagonell

Heh -- yeah, there are a few cases of that. And it's one of the reasons I have to hand-edit the results: one of the more common errors is the system getting husbands and wives mixed up with each other. (Which I believe indicates a bug in the edit-distance system, but that's a separate matter.)

You're an impressively extreme case, though -- not sure I know anybody else with *four* parallel awards like that...

Brace yourself. I just checked the OP. It's actually six out of seven. Pelican, Keystone, Sycamore, Silver Crescent, AoA and VOID. The only award we didn't get simultaneously was the Millrind.

Wow. That has to be some sort of record...

If so, it squeaks. Checking the AE OP, I found Cygnus and Dorinda have 5 together, however, 2 of those are Duke/Duchess, Count/Countess. This seems like something your system should be able to check. See what you can find in the East OP.

Should be much easier in the new system, which is a proper MySQL database instead of flat files...

Have you thought much about the next steps?

Meaning, assuming the OP Database is perfect, we would then want to use it for recommendations, allowing people to choose from the accurate, properly spelled name. And let them know what the person already has so they don't recommend them for an award they can't have.

Or building a polling engine that lets people who are members of an order in the OP Database access it. So immediately upon getting an award you have the systems access that should be associated with it.

Of course that has the prerequisite of creating accounts that map to personas in the OP Database. So Omega has a login id and password that does not change even when his name changes, but his OP, of course, stays the same through it all...

Well, keep in mind that we aren't building the new database from scratch. We're basically starting with a (rather good) system that was developed for Atlantia some years ago, stripping off the Kingdom-specific stuff, and getting it working here.

So yes, those are all good ideas, and I'd encourage them to be gradually added. And the key concepts do exist -- in particular, the system *does* have a concept of account/person matching. In fact, I believe it has a Polling Order Forums mechanism of some sort built in, although that has enough problems with porting that we're simply disabling it for the time being.

(It's also designed to run the Heralds, the Scribes, and the University: it is *quite* a substantial piece of code.)

But for now, we're just worrying about getting the OP part up and running. In the medium term, once things are stable, I may encourage the creation of a team to take this big chunk of open-source PHP, regularize and modularize it, and really make it sing for our needs...

Cool. Still seems like a lot of refining to go, though.

There is, yes, although at some point I'm going to have to declare diminishing returns. Of course, the new system should be *vastly* better for making individual ad-hoc fixes, so the manual changes are best done after the port anyway. I'm mostly focused on doing as much data-mining as I can while it's all in RAM, within the limited schedule available...

I got my first kingdom-level award (that I recall, anyway; possibly I should check the alphabetical listing to see if any of my real persona names ever showed up) nearly 30 years into my presence in the SCA, and one of the reports gives the wrong spelling of my shire?

Also, it looks like my wife's listing got mixed up with someone else; I'll check with her when we get a moment.

I got my first kingdom-level award (that I recall, anyway; possibly I should check the alphabetical listing to see if any of my real persona names ever showed up) nearly 30 years into my presence in the SCA, and one of the reports gives the wrong spelling of my shire?

Yeah, that sort of thing is very common, and is why this has been so challenging. In particular, the data in the Court Reports is *extremely* dirty, because they were typically written by heralds working from poor information.

Also, it looks like my wife's listing got mixed up with someone else; I'll check with her when we get a moment.

Cool; thanks. Whether it's more appropriate to fix now or later depends on the nature of the problem...

Um, yeah. I was worried because it looked like someone else's identity was mixed into the report, and so the other person might be stranded (data-wise) also. The date is right (4 Feb 2012), and Kristin's AoA was awarded under the name of "Melusine of Quintavia."

... Kristin is Melusine of Quintavia? I'm amused - I found that particular data bug last week, and fixed it this morning...

I think it's going to end up being "Merlusine" in the end, because she couldn't find any record of "Melusine" other than the mythological one. The more interesting heraldic challenge is trying to keep her arms from looking too much like the Starbucks logo.

I don't have that problem, because I've kept the "Sojourner" nickname through my various persona changes, and it's what Katherine used on my AoA scroll.

I have nothing like enough programming chops to understand this on a technical level, but as a data-geek with librarian and herald sub-fields I read it all the way through and found it really cool.

  • 1