Log in

No account? Create an account
Previous Entry Share Next Entry
Please ignore the sound of that exploding head...
So as I've mentioned before, my current programming project is the OP Compiler: taking the existing Order of Precedence and taming it with code, so that it can get fed into a nice, neat, vastly easier-to-maintain database going forward. I figured it would be a meaty but reasonably straightforward project -- after all, Caitlin was inhumanly good with data, and so the old HTML files should be at least *reasonably* consistent, right?

I am beginning to realize that my assumptions were incorrect. Caitlin *was* fabulous with data, and the flat files are perhaps more consistant than any other person could possibly have managed. But even she was human, and dealing with data from a zillion sources, with nothing automated checking the details.

So now I'm up to the point where I am successfully "compiling" a fair number of chronological court-report files (around the past eight years' worth), and all of "A" in the alphas, and I'm finding just how impossible the job had been. Everything *looks* great, and I don't think one person in 100 would catch more than a tiny number of errors. But besides the structural irregularities that I've been pulling my hair out over (mind, those huge files are completely hand-edited HTML, and the format isn't even remotely as consistent as it looks on the screen), it turns out that there are tons of *tiny* data bugs.

Let's just take the King, for example. Now that I'm actually able to print out what the compiler thinks is going on, I find that "Kenric Burn of Northampton" has a Valiant Tyger; "Kenric of Warwick" was a Rattan Champion, has a King's Cypher, and was named Crown Prince; and "Kenrick of Warwick" was Queen's Champion a couple of times, and got the Shield of Chivalry a couple of times. And when I get to parsing the K's, I'm going to have to rewrite his entry so that it cross-references all of these properly.

Mind, none of this is to fault Caitlin -- by and large, she was typing in what she was given by the heralds, and I'm 99% certain that she screened out 90% of the errors that were handed to her. But it is all demonstrating that the job of Shepherd's Crook really *is* impossible to do by hand, and it's miraculous that she managed to make it work as well as she did for as long as she did.

Anyway, the end result of all of this is going to be quite a substantial chunk of code. I am likely to open-source it, more as a way for me to kick the tires of Github than because I expect it to ever be used a second time (yes, I'm spending a solid two months writing a program that will, in the end, be run exactly once). But if anyone wants to see a medium-sized body of decently structured and not *excessively* cryptic Scala code, just pipe up and I'll be happy to point you at it and discuss what's going on in it...

  • 1
the job of Shepherd's Crook really *is* impossible to do by hand

What was done in the old days? Are there more courts now? More awards? (i'd imagine that the Society grew a bunch since then, so there probably are in fact more award recipients).

Not really -- Caitlin was Shepherd's Crook from something like 1995-2010, so pretty much covered the Society's "peak" period.

But keep in mind, the data is cumulative -- 40 years' worth of it -- and the big problem is inconsistency. The Kenric example noted above is a fine one: while he's always been "Kenrick", the details of his name shifted a number of times before settling down; moreover, he is still known (in fine period fashion) by a number of variants. Even with a steel-trap mind like Caitlin's, it's impossible to keep all of those variations in your head, and make sure it's all consistent on paper.

Hence the current OP Database project. Once that's up and running, it should become *vastly* more straightforward to avoid most of these inconsistencies, since you'll start a new entry by *finding* Kenric's entry and adding a new record instead of just typing. But getting there is proving to be a heck of a job...

Among the problems I'm sure you've encountered:

- precedence attaches to the person, not the persona
- people can and do change not only their personas, but also the preferred spelling of it...
- and their legal names can change, too

- precedence attaches to the person, not the persona
- people can and do change not only their personas, but also the preferred spelling of it...
- and their legal names can change, too

Actually, the second of those is the only really important one. This system is deliberately dumb about precedence per se -- I don't actually *care* about someone's resulting rank. (The eventual DB will, of course, but that's already written and mostly correct.) And legal names don't exist in my source data. (Caitlin tracked some of that in private files, but those are old and obsolete, so I'm not worrying about them.)

The one big problem is that the persona cross-referencing exists *solely* in the way the names are listed in the Alpha list -- eg, "Roger Dunbarton was: James of the Creeping Rodents". That *is* critical information, and is unfortunately wildly inconsistent in how it is presented. Parsing it has been heaps of fun...

(Deleted comment)
More or less -- the Recognition structure is really the heart of things, and records the Persona that was recognized (which insofar as possible rolls up to a Person object), and the AwardAs -- that is, the *term* that was used to give the award. The latter rolls up to Award, but the variant can provide, eg, gender cues. (Pretty much the only way I have to guess at a persona's gender, which is a field in the DB we inherited.) And of course, if I can map the award to a particular Court, that mapping is also preserved, along with whether the award was also found in the Alpha list and the list by Award type.

So yes: I'm normalizing three parallel structures, and trying to lose zero data as I do so. Kind of fun from a data-modeling POV. (Document per se is mostly implicit from the rest of the data...)

(Deleted comment)
Well, keep in mind that the "source documents" are relatively artificial -- they're already secondary sources, and don't have any direct cites to the Court Reports that are the actual sources. (And unfortunately, for immigrants, I don't know of any direct citations to where the data came from.)

And these sources are sufficiently organized that it's not worth writing them down explicitly. Source 1 is a series of files, generally one per reign, of Court Reports; Source 2 is the Alphabetical listing of citizens and their awards; and Source 3 is the chronological Listing, for each award, of the people who received it. So at least in theory, going from my synthetic Recognition record to the sources it was found in is completely mechanical.

That might change if I wind up doing fancy heuristics to guess mismatches -- for example, I'm thinking of adding an algorithm that looks for Alpha listings with no matching Court report, and tries to match them up with Court report listings with no Alpha. (Which will usually indicate a changed name that didn't get properly flagged.)

But in practice, my plan is to simply fix the source data in cases where I'm reasonably sure of it, and leave the inconsistency if I'm not. One of the larger steps in this process is going to be laboriously going through the source files (once they're frozen, which hasn't happened yet) and my error lists, and producing a final cleaned-up master OP in the old format...

You've probably all read this already, but just in case you haven't: Falsehoods Programmers Believe About Names

I remember the article well from the nymwars. And even if I didn't already know it, the current project would drive it deeply home.

The underlying concept in the OP, which I am picking up in my software, is that there is a central Person concept, and each Person may have any number of Personae. One of those Personae is slightly privileged as the "current best" one -- the more-or-less official name that we are going to list this person under. But all of the other names are being tracked to the best of my ability, and most will wind up in the final OP as alternate names.

(The exceptions will be cases where my best guess is that one version is a misspelling or typo, in which case I'll just normalize it. Given the range of hard-to-spell names, that's not at all unusual, especially in Court Reports, where the spelling has usually been subject to a long game of telephone.)

Note that the output OP does *not* have a clear concept of "persona", just alternate names. This is slightly unfortunate, mostly in the handling of gender -- gender is associated with the Person (which, as a Carolingian, I am all too aware of the limitations of), as well as with individual awards. I'm going to do what I can to populate this based on the data I have, but am going to err on the side of leaving it blank if it's at all ambiguous, which is probably most of the time.

My name has had three variations, and more misspellings than I care to count. So yah... I hear you on this one.

I can attest personally to the fact that Caitlin made corrections on the fly when someone pointed them out. Otherwise it probably would have been *much* worse. There was more than one instance where I sent her corrections and she cleaned up the offending entry post haste. I don't think nearly enough people realized how much of a labor of love that OP was.

Fascinating. I can tell you that working with student and faculty data, this is not an isolated problem. I'm constantly receiving communications and requests where names are misspelled, nicknames are used that aren't in official sources, unrecorded name changes have happened, or there are subtle name conflicts where multiple people share identical or near identical names. And with a bit of a fragmentary account system, this can be made worse when we sometimes end up with duplicate accounts....

It's not just data about humans either; place names have a similar property. Working on a local search product has driven home how much differs in how humans both list and search about place names...

For the record, I think that we spent about 6 months writing software that ended up being used exactly once at MetaCarta. Specifically, it was software for taking a giant software repo, pulling out all the bits that we either never should have had in the repo in the first place, or that we weren't intending to give over to the New Owners of the code, and producing an updated artifact with reasonably full history (including matching revision numbers so that commit comments didn't become wrong, iirc.)

In the end, all of the existing software was insufficient, and the end result was far too specific to be general purpose usable (though a lot of the bits got pushed back up into various related tools that we ended up using or repurposing).

Funnily enough, we did set up a script using off-the-shelf pieces originally -- which was going to take too long to run. "How long is too long" in this case was "Well, it took more than the 4 months we had before we had to shut down the server room it was running in and move it across town to our new location." (The final code ran in a couple days, I think, which was good, because it turned out we had to iterate on the final result several times; this wasn't a case of premature optimization, it was just 'off the shelf software won't solve this problem even if it runs for months'.)

  • 1