Log in

No account? Create an account
Previous Entry Share Next Entry
Oh, those wacky fencing Orders
My main project while on Christmas break down in Florida (more notes on that anon) was to work on the OP Compiler. That's steaming forward nicely -- the Alpha Lists are now all parsing, nearly all of the Award Lists are doing so, and the Court Reports are parsing back as far as Siegfried and Wanda: there's a chance I could finish the Court Reports tomorrow.

So why am I posting after midnight? Because the bloody Fencing awards were driving me crazy, as I (correctly) realized that I probably had them grouped wrong.

The thing is, the East has had a bunch of fencing awards, including a long-since-closed one named the Guardsman, and they've gone through a lot of names, since some didn't get named immediately. So I just spent about 45 minutes correlating the Court Reports, comparing them with the Award Lists, and have come to the following conclusions:

The Order of the Guardsman can be found in the OP as the "East Kingdom Fencing Order", the "East Kingdom Fencing Award", and the "East Kingdom Order of Fence", as well as the Guardsman.

On the flip side, the Order of the Golden Rapier is found as the "High Merit for Fence", the "Kingdom Order of Fence", and simply the "Order of Fence", before things settled down on the OGR.

You see why I was tearing my hair out over this?

I *think* I have it all straight, and blessedly I don't think there is any term that is used for *both* of the awards. (Yes, I could theoretically distinguish by date, since the Guardsman was closed before the OGR started, but that goes *way* against the architecture, and would be a ton of work -- the notion of two different awards with the same name is not a pretty one. If I found conflicts, I'd actually rewrite the Court Reports instead.)

Anyway, making progress. With any luck, before terribly long I'll be ready for folks to take the consolidated printout, and tell me about all the synonymous names you find. I suspect that the average person in the OP will turn out to have around two disconnected entries. (I only have one, but I've had the same name since long before my AoA. By contrast, Kenric has something like eight variations of his name, and Darius has at least half a dozen that don't even resemble each other...)

  • 1
Yeah, those were *interesting* times, socio-politically, in the fencing community.

It was actually noticing that you'd been inducted into both the East Kingdom Fencing Order and the OGR, some years apart, that made me nervous enough to start really looking into the issue...

:-) Yes, that would have been, what, 1991 and 1994?

Your consolidated entry, hot off the presses:
  == Donal Artur of the Silver Band [registered] ==
    07/23/1989 ACL Award of Arms -- Donal Artur of the Silver Band
    10/07/1989 ACL Troubadour -- Donal Artur of the Silver Band
    04/27/1991 A   Perseus (Carolingia) -- Donal Artur of the Silver Band
    05/11/1991   L Perseus (Carolingia) -- Donal Artur of the Silver Band
    09/28/1991 ACL East Kingdom Fencing Order -- Donal Artur of the Silver Band
    10/05/1996 A L Golden Rapier -- Donal Artur of the Silver Band
The inconsistency in the Perseus is interesting -- at some point, we should figure out which one's correct...

I'm not sure what the A/C/L means. My recollection is that I was (I think) the 4th Perseus, after Peter, Danulf and... (Sebastian?). The timing was right (late April/Early May) but I can't remember at which event it happened.

The letters indicate data source:

A: this datum was found in the Alpha listing by name of person
C: this datum was found in a Court Report
L: this datum was found in the Listing by Award name

Ideally, each entry would say all three. In practice, many don't for one reason or another (for instance, there are no Court Reports for Baronial courts). But inconsistencies show up like this, with different letters for different dates. One planned enhancement is to highlight these cases automatically.

So basically, your personal entry says one thing, and the list of Perseids says another. May take some work to figure out which one's correct...

Might be easy enough: I think there were others at the time; it should be with the second group awarded. The OGR should be correct; it was under Bjorn and Morgen, in the fall out at the Rod and Gun Club, and could well have been their last court.

Fencers: We ruin everything.

Nah. That's, like, effort and dedication.

(Deleted comment)
The double check by date would be a horrible amount of work and kludging of the architecture; it's not worth it for one or two edge cases, especially since this particular problem doesn't really generalize. (It's a side-effect of the bizarre politics of fencing in the 80s and 90s.)

Yes, it is possible that there could be a couple of errors, but I assure you that there are many, vastly worse errors in there -- heck, I just had a completely-missing peerage pointed out to me. The goal here isn't perfect correctness: rather, I'm trying to produce a decent best-effort starting point for the new system, in which future corrections will be made.

Nobody should fool themselves into believing that the output of my program is going to be magically pristine -- rather, I am taking a data set with *thousands* of inconsistencies (maybe tens of thousands), and trying to reduce that to hundreds.

Seriously: Jane was brilliant, and it's amazing that she made it work as well as she did. But people think that the data's a lot more consistent than it really is. The compiler exposes the myriad of tiny inconsistencies that humans don't notice, but machines do. Just for one example: your consolidated entry reveals that your AoA has a typo in the date in the Awards Listing.

For amusement, I commend the log file, which I'm checking in to track the evolution of the tool:


That's several meg of output (about 95 thousand lines so far, and still growing fast), and the second half is the consolidated OP by name. There are a *staggering* number of inconsistencies revealed there, many very clearly, mostly simply due to the loosey-goosey way we treat names.

Mind, I am going to be doing a lot of double-checks. The mismatch-analysis system, to try and line up people who get called different things in different places, is going to be one of the most sophisticated parts of the whole tool. (I'm reading into edit-difference algorithms now.) And I should be able to fix most date typos like your AoA. But I have to choose my battles if it's going to get done, and the deadline is pressing...

(Deleted comment)
The problem is, there's no such concept as "this award is only valid between dates X and Y", or worse, "this name means A on these dates and B on these others", which is actually the problem at hand. That's a whole new subsystem to add for what looks to be a one-shot edge case, that is pointless otherwise. Not so much "hard", as probably a bunch of hours' work (much of that in research into award dates) that is, in all likelihood, to no point whatsoever. And mind, this is what I'm doing in my spare time, distracting me from actually getting my business off the ground.

Again, it's all about what's worth the effort. This thing is supposed to be *done* in a few weeks, and I have vastly more work to do than is possible in that time, given that I *should* be spending 200% of my time on Querki. So you're essentially advocating a schedule slip for a problem that we don't know exists. One thing I've learned the hard way is that, when you live the startup life, you *must* learn to prioritize.

But the former check, as a simple post-processing step, should be trivial: give me a list of all awardings of award X, sorted by date - and the human eye can catch the out of bounds dates.

Otherwise known as the output of the system. Again, you're treating this as a be-all-and-end-all, which it isn't. This is simply producing the data that will be fed into the new online database -- which is likely to be much, much, much easier to examine and edit. So spending effort on such things now is just silly...

(Deleted comment)
I note that you HAVE this problem, and therefore it is not a hypothetical problem.

That really isn't clear. While there appeared to be *surface* ambiguity, it currently looks like there is no *actual* ambiguity. And I'm finished parsing all the files covering that date range (that is, everything up to before fencing was introduced), so there is no reason to believe any ambiguous records will be added.

So you are basically telling me that I have a problem I should fix, which I have no reason to believe is true. Hence, I'm reacting a bit negatively to you pushing so hard on it. You're basically playing Bad Project Manager, telling the engineer that something should be easy when you don't actually understand the problem in enough detail. (Which you really don't. You think you do, but you don't -- you're making an awful lot of unwarranted assumptions.)

It appears to me that my suggestion would make this parser a bit MORE generalizable than it is.

Perhaps -- and why is that a good thing? This is a tool that will be run exactly once, ever and for all time. Spending effort generalizing it is *exactly* the kind of gold-plating I shouldn't be doing. I need to stay focused on the actual problems at hand, not hypothetical generalizations that aren't needed. I am *not* building a general-purpose tool that will be useful across many problem domains. (That's what Querki is for.) I'm building an extremely specific tool for an extremely specific problem, and over-generalization is mostly a way to blow the deadline.

And again - adding a simple error checking pass for dates at the END of the process, instead of within the parsing tool, is probably well worth the time to do. Judging from what you told Don about his Perseus, you are going to need to do SOME sort of date comparator anyway - to detect duplicates or near duplicates.

Yes, but this is an almost entirely different problem. Discovering mismatched dates within an individual record (which I will be fixing), and the more interesting problem of finding *matching* entries under *different* names (which is worth putting some real effort into, since it's extremely common) bears no actual technical resemblance to defining valid dates for specific awards.

It may *sound* similar because it uses the same words, but on a technical level it's involving totally different data structures. And you're introducing a requirement for data that flat-out doesn't exist in the original (which is the biggest problem with your suggestion), and isn't used in the final output, for no apparent gain...

Let's step back a sec. I think you've forgotten the parameters of this project, and I'm getting too heated.

Remember, I am *not* building the final OP system -- we got that from Atlantia, who already have a fairly nice one. My remit is pretty specific: to translate the existing hand-rolled tables into a MySQL database suitable for that new system to use. This program is only intended to be run once, and then is nothing more than an historical oddity.

While I'm doing this, the East Kingdom is *frozen*: nothing new can be entered into the OP. That's an unfortunate requirement, but necessary, since so much of my project is hand-fixing the existing data files to make them parse decently. They look nice on the surface, but the HTML is *vastly* less consistent than you would guess. I've put a ton of work into the parser, but I can only make it so robust before hitting rapidly-diminishing returns. It is just plain hard for most people to edit this stuff without introducing a variety of accidental problems, so I need the data to be locked-down while I work.

That means Deadline Uber Alles -- the very highest, tippy-toppest priority is getting things Done as quickly as possible, so we can all start working again with the new, vastly more robust system.

Along the way, I'm tackling a *little* bit of low-hanging fruit: problems where a *little* programming now can make a *lot* of difference in the final result. That mainly consists of detecting inconsistencies that are most easily found while the whole database is in memory. My belief is that, for fairly modest effort, I can fix over half of the inconsistencies in the data.

But that's basically it. The sort of problems you're describing -- that aren't mainly about programming, they're mainly about manual labor (in this case, figuring out the correct valid dates for the various awards), and which are not *clearly* rife in the existing data -- simply isn't an appropriate use of time. That sort of thing is best tackled later, in the new system once it's up and running.

That's why I'm pushing back so hard. On a technical level, you're somewhat incorrect about how things work; but more importantly, from a project-management viewpoint, it is inappropriate scheduling. Keep in mind how much of my career is spent pushing back against managers who are trying to make exactly this sort of mistake; by necessity, I tend not to be gentle in such arguments. (This is why I tend to wind up teaching project management at most companies I work for nowadays.)

Does this mean that the new system will be starting with a lot of baggage of incorrect data? You bet it does -- I hope to reduce that a bit, but even that is really a self-assigned task, outside my formal remit. The only true goal here is to translate the old OP to the new one as accurately as possible. *After* that comes the much larger problem of finding and fixing 40 years' worth of bugs. But that is not, by and large, appropriate to tackle on the extremely critical path I'm standing in the middle of.

My apologies for coming on a bit too strong here. I understand what you're suggesting; this just isn't the right time to be doing it from the overall project perspective, and I'm a tad stressy about all the looming deadlines...

  • 1