Previous Entry Share Next Entry
Oh, those wacky fencing Orders
My main project while on Christmas break down in Florida (more notes on that anon) was to work on the OP Compiler. That's steaming forward nicely -- the Alpha Lists are now all parsing, nearly all of the Award Lists are doing so, and the Court Reports are parsing back as far as Siegfried and Wanda: there's a chance I could finish the Court Reports tomorrow.

So why am I posting after midnight? Because the bloody Fencing awards were driving me crazy, as I (correctly) realized that I probably had them grouped wrong.

The thing is, the East has had a bunch of fencing awards, including a long-since-closed one named the Guardsman, and they've gone through a lot of names, since some didn't get named immediately. So I just spent about 45 minutes correlating the Court Reports, comparing them with the Award Lists, and have come to the following conclusions:

The Order of the Guardsman can be found in the OP as the "East Kingdom Fencing Order", the "East Kingdom Fencing Award", and the "East Kingdom Order of Fence", as well as the Guardsman.

On the flip side, the Order of the Golden Rapier is found as the "High Merit for Fence", the "Kingdom Order of Fence", and simply the "Order of Fence", before things settled down on the OGR.

You see why I was tearing my hair out over this?

I *think* I have it all straight, and blessedly I don't think there is any term that is used for *both* of the awards. (Yes, I could theoretically distinguish by date, since the Guardsman was closed before the OGR started, but that goes *way* against the architecture, and would be a ton of work -- the notion of two different awards with the same name is not a pretty one. If I found conflicts, I'd actually rewrite the Court Reports instead.)

Anyway, making progress. With any luck, before terribly long I'll be ready for folks to take the consolidated printout, and tell me about all the synonymous names you find. I suspect that the average person in the OP will turn out to have around two disconnected entries. (I only have one, but I've had the same name since long before my AoA. By contrast, Kenric has something like eight variations of his name, and Darius has at least half a dozen that don't even resemble each other...)

  • 1
Yeah, those were *interesting* times, socio-politically, in the fencing community.

It was actually noticing that you'd been inducted into both the East Kingdom Fencing Order and the OGR, some years apart, that made me nervous enough to start really looking into the issue...

:-) Yes, that would have been, what, 1991 and 1994?

Your consolidated entry, hot off the presses:
  == Donal Artur of the Silver Band [registered] ==
    07/23/1989 ACL Award of Arms -- Donal Artur of the Silver Band
    10/07/1989 ACL Troubadour -- Donal Artur of the Silver Band
    04/27/1991 A   Perseus (Carolingia) -- Donal Artur of the Silver Band
    05/11/1991   L Perseus (Carolingia) -- Donal Artur of the Silver Band
    09/28/1991 ACL East Kingdom Fencing Order -- Donal Artur of the Silver Band
    10/05/1996 A L Golden Rapier -- Donal Artur of the Silver Band
The inconsistency in the Perseus is interesting -- at some point, we should figure out which one's correct...

I'm not sure what the A/C/L means. My recollection is that I was (I think) the 4th Perseus, after Peter, Danulf and... (Sebastian?). The timing was right (late April/Early May) but I can't remember at which event it happened.

The letters indicate data source:

A: this datum was found in the Alpha listing by name of person
C: this datum was found in a Court Report
L: this datum was found in the Listing by Award name

Ideally, each entry would say all three. In practice, many don't for one reason or another (for instance, there are no Court Reports for Baronial courts). But inconsistencies show up like this, with different letters for different dates. One planned enhancement is to highlight these cases automatically.

So basically, your personal entry says one thing, and the list of Perseids says another. May take some work to figure out which one's correct...

Might be easy enough: I think there were others at the time; it should be with the second group awarded. The OGR should be correct; it was under Bjorn and Morgen, in the fall out at the Rod and Gun Club, and could well have been their last court.

Fencers: We ruin everything.

Nah. That's, like, effort and dedication.

I think you SHOULD do a double check by date.

Nameless orders are a pain.

The double check by date would be a horrible amount of work and kludging of the architecture; it's not worth it for one or two edge cases, especially since this particular problem doesn't really generalize. (It's a side-effect of the bizarre politics of fencing in the 80s and 90s.)

Yes, it is possible that there could be a couple of errors, but I assure you that there are many, vastly worse errors in there -- heck, I just had a completely-missing peerage pointed out to me. The goal here isn't perfect correctness: rather, I'm trying to produce a decent best-effort starting point for the new system, in which future corrections will be made.

Nobody should fool themselves into believing that the output of my program is going to be magically pristine -- rather, I am taking a data set with *thousands* of inconsistencies (maybe tens of thousands), and trying to reduce that to hundreds.

Seriously: Jane was brilliant, and it's amazing that she made it work as well as she did. But people think that the data's a lot more consistent than it really is. The compiler exposes the myriad of tiny inconsistencies that humans don't notice, but machines do. Just for one example: your consolidated entry reveals that your AoA has a typo in the date in the Awards Listing.

For amusement, I commend the log file, which I'm checking in to track the evolution of the tool:

That's several meg of output (about 95 thousand lines so far, and still growing fast), and the second half is the consolidated OP by name. There are a *staggering* number of inconsistencies revealed there, many very clearly, mostly simply due to the loosey-goosey way we treat names.

Mind, I am going to be doing a lot of double-checks. The mismatch-analysis system, to try and line up people who get called different things in different places, is going to be one of the most sophisticated parts of the whole tool. (I'm reading into edit-difference algorithms now.) And I should be able to fix most date typos like your AoA. But I have to choose my battles if it's going to get done, and the deadline is pressing...

I still don't know Scala. (I bought the Impatient book when you recommended it - but still have not had the time/will to read it. It stares balefully on my nightstand.) And, even if I did know Scala, I lack the time to fully read and digest such a complex tool.

But I still find myself a bit surprised. The end result of your parsing should contain, at a minimum, some tuple of "who what-award when".

The idea that you can't, in some post-processing step, weed out that certain "what" items have date constrictions upon them ("not before", "not after") seems unlikely to me. Surely that won't break the architecture?

Frankly, if your architecture doesn't permit some filtration of the pool of awards in some way, perhaps it could be reasonably extended. So the list of "awards" could also certainly be annotated by dates. No one could be awarded a Pelican before the Pelican was created, no one can be awarded an Athena's Thimble after it was closed. That makes the list of awards a bit dynamic: The list on 1/1/2045 is certainly a proper subset of the list of all awards EVER, and may only be a (largely identical) set with the award lists available on the day before.

That entire last paragraph MIGHT be an architectural challenge that your system does not want to withstand, of course. But the former check, as a simple post-processing step, should be trivial: give me a list of all awardings of award X, sorted by date - and the human eye can catch the out of bounds dates.

The problem is, there's no such concept as "this award is only valid between dates X and Y", or worse, "this name means A on these dates and B on these others", which is actually the problem at hand. That's a whole new subsystem to add for what looks to be a one-shot edge case, that is pointless otherwise. Not so much "hard", as probably a bunch of hours' work (much of that in research into award dates) that is, in all likelihood, to no point whatsoever. And mind, this is what I'm doing in my spare time, distracting me from actually getting my business off the ground.

Again, it's all about what's worth the effort. This thing is supposed to be *done* in a few weeks, and I have vastly more work to do than is possible in that time, given that I *should* be spending 200% of my time on Querki. So you're essentially advocating a schedule slip for a problem that we don't know exists. One thing I've learned the hard way is that, when you live the startup life, you *must* learn to prioritize.

But the former check, as a simple post-processing step, should be trivial: give me a list of all awardings of award X, sorted by date - and the human eye can catch the out of bounds dates.

Otherwise known as the output of the system. Again, you're treating this as a be-all-and-end-all, which it isn't. This is simply producing the data that will be fed into the new online database -- which is likely to be much, much, much easier to examine and edit. So spending effort on such things now is just silly...

How you spend your time is up to you - it is one of life's most precious commodities.

I note that you HAVE this problem, and therefore it is not a hypothetical problem. How much elbow grease this problem merits, is again your decision to make.

It appears to me that my suggestion would make this parser a bit MORE generalizable than it is. You are trying to use context to map sometimes-ambiguous information to more restricted categories, yes? So that you can import that more canonical data into a repository tool...

Given that date is an available contextual bit of information, it seems to me that it is useful for pruning ambiguity. If in 1994 Fred gets "The Fencing Award", we know it isn't OGR. If it is in 2004, we know it is.

If the idea I proffer is not sized to fit your resources and the scale of the problem, only you can decide that. If the decision is easy (and the answer is no), that doesn't necessarily make the idea or the offer silly...

And again - adding a simple error checking pass for dates at the END of the process, instead of within the parsing tool, is probably well worth the time to do. Judging from what you told Don about his Perseus, you are going to need to do SOME sort of date comparator anyway - to detect duplicates or near duplicates.

I note that you HAVE this problem, and therefore it is not a hypothetical problem.

That really isn't clear. While there appeared to be *surface* ambiguity, it currently looks like there is no *actual* ambiguity. And I'm finished parsing all the files covering that date range (that is, everything up to before fencing was introduced), so there is no reason to believe any ambiguous records will be added.

So you are basically telling me that I have a problem I should fix, which I have no reason to believe is true. Hence, I'm reacting a bit negatively to you pushing so hard on it. You're basically playing Bad Project Manager, telling the engineer that something should be easy when you don't actually understand the problem in enough detail. (Which you really don't. You think you do, but you don't -- you're making an awful lot of unwarranted assumptions.)

It appears to me that my suggestion would make this parser a bit MORE generalizable than it is.

Perhaps -- and why is that a good thing? This is a tool that will be run exactly once, ever and for all time. Spending effort generalizing it is *exactly* the kind of gold-plating I shouldn't be doing. I need to stay focused on the actual problems at hand, not hypothetical generalizations that aren't needed. I am *not* building a general-purpose tool that will be useful across many problem domains. (That's what Querki is for.) I'm building an extremely specific tool for an extremely specific problem, and over-generalization is mostly a way to blow the deadline.

And again - adding a simple error checking pass for dates at the END of the process, instead of within the parsing tool, is probably well worth the time to do. Judging from what you told Don about his Perseus, you are going to need to do SOME sort of date comparator anyway - to detect duplicates or near duplicates.

Yes, but this is an almost entirely different problem. Discovering mismatched dates within an individual record (which I will be fixing), and the more interesting problem of finding *matching* entries under *different* names (which is worth putting some real effort into, since it's extremely common) bears no actual technical resemblance to defining valid dates for specific awards.

It may *sound* similar because it uses the same words, but on a technical level it's involving totally different data structures. And you're introducing a requirement for data that flat-out doesn't exist in the original (which is the biggest problem with your suggestion), and isn't used in the final output, for no apparent gain...

Let's step back a sec. I think you've forgotten the parameters of this project, and I'm getting too heated.

Remember, I am *not* building the final OP system -- we got that from Atlantia, who already have a fairly nice one. My remit is pretty specific: to translate the existing hand-rolled tables into a MySQL database suitable for that new system to use. This program is only intended to be run once, and then is nothing more than an historical oddity.

While I'm doing this, the East Kingdom is *frozen*: nothing new can be entered into the OP. That's an unfortunate requirement, but necessary, since so much of my project is hand-fixing the existing data files to make them parse decently. They look nice on the surface, but the HTML is *vastly* less consistent than you would guess. I've put a ton of work into the parser, but I can only make it so robust before hitting rapidly-diminishing returns. It is just plain hard for most people to edit this stuff without introducing a variety of accidental problems, so I need the data to be locked-down while I work.

That means Deadline Uber Alles -- the very highest, tippy-toppest priority is getting things Done as quickly as possible, so we can all start working again with the new, vastly more robust system.

Along the way, I'm tackling a *little* bit of low-hanging fruit: problems where a *little* programming now can make a *lot* of difference in the final result. That mainly consists of detecting inconsistencies that are most easily found while the whole database is in memory. My belief is that, for fairly modest effort, I can fix over half of the inconsistencies in the data.

But that's basically it. The sort of problems you're describing -- that aren't mainly about programming, they're mainly about manual labor (in this case, figuring out the correct valid dates for the various awards), and which are not *clearly* rife in the existing data -- simply isn't an appropriate use of time. That sort of thing is best tackled later, in the new system once it's up and running.

That's why I'm pushing back so hard. On a technical level, you're somewhat incorrect about how things work; but more importantly, from a project-management viewpoint, it is inappropriate scheduling. Keep in mind how much of my career is spent pushing back against managers who are trying to make exactly this sort of mistake; by necessity, I tend not to be gentle in such arguments. (This is why I tend to wind up teaching project management at most companies I work for nowadays.)

Does this mean that the new system will be starting with a lot of baggage of incorrect data? You bet it does -- I hope to reduce that a bit, but even that is really a self-assigned task, outside my formal remit. The only true goal here is to translate the old OP to the new one as accurately as possible. *After* that comes the much larger problem of finding and fixing 40 years' worth of bugs. But that is not, by and large, appropriate to tackle on the extremely critical path I'm standing in the middle of.

My apologies for coming on a bit too strong here. I understand what you're suggesting; this just isn't the right time to be doing it from the overall project perspective, and I'm a tad stressy about all the looming deadlines...

  • 1

Log in

No account? Create an account