Log in

No account? Create an account
Previous Entry Share Next Entry
What are the key bits of "theory" that every programmer should know?
Another day, another networking event -- I'm slowly getting used to going to all these Boston Tech Meetups and such, to meet people, talk up Querki and start to understand how one gets an investment.

Along the way, I'm chatting with lots of folks, and a remarkably large fraction lead off with, "Well, I've always been doing X, but I want to learn to code". (Last night's was a fellow who does financial compliance work for one of the large funds.) These folks are usually self-taught, and tend to be very self-deprecating about the fact that they didn't go to school so they don't *really* understand programming. A couple of the programmers I was with and I got chatting about that, and the fact that, yes, the best way to learn to program is by doing. A degree in CS is helpful, but mostly in that it teaches you some of the underlying theory for programming *well*; the nuts and bolts change so often that the details you learn in school will only be useful for a limited time anyway. Somewhere in there, I asserted that you could probably list all of the most-useful bits of theory and practice in one brief talk anyway.

So, here's a challenge: help me figure out what those are. What are the key engineering principles that *every* programmer should know, that probably aren't obvious to a newbie and which aren't necessarily going to be taught in an online "How to Java" class?

I'll start out with a few offhand:

Refactoring: great code doesn't usually come from a Beautiful Crystalline Vision that some programmer dreams up -- it comes from writing some code, getting it working, and then rearranging it to make the code *better* while it's still working. That's "refactoring": the art of making the code cleaner without changing what it's doing. It's a good habit to get into, especially because it takes practice. (Granted, listing all the major refactoring techniques is a good-sized talk itself; I highly recommend Fowler's book on the subject.)

The DRY (Don't Repeat Yourself) Principle: which I usually describe as "Duplication is the source of all evil". Any time you are duplicating code, you're making it much more likely that you'll get bugs when things change. Much of refactoring is about merging things to eliminate duplication. Similarly, duplicate data is prone to getting out of sync and causing problems, so you should usually try to point to the same data when it's convenient to do so.

Efficiency is good, but algorithmic complexity is what matters: this is what's often called "Big-O" notation in computer science. How fast things run *does* matter, but only in the grand scheme of things. Whether this approach takes twice as long as that one probably doesn't matter unless you're doing it a bazillion times per second. What *does* tend to matter, given a list of size n, is whether you're going through it just once -- O(n) in the notation -- or whether each time through you're going through the whole list again -- O(n^2) in the notation, that is, "n-squared". (You'd be surprised how easy it to to wind up with algorithms that are n^2 or even n^3 -- that can actually get slow.) Or, if you have two list m and n, does your approach take O(n+m) time, or O(n*m)? It's worth practicing thinking through these order-of-magnitude evaluations and getting an intuition for it. That said...

Big stuff swamps small stuff: in one community the other day, I pointed out an approach to solving a problem that involved creating an extra object for each HTTP call. One of the folks in the discussion asked whether that inefficiency would matter, and I had to point out that you're already handling an HTTP call -- at *best*, the overhead of that handler is at least 1000 times that extra object creation, quite likely 10000 times more, so this is a drop in the bucket. So keep scale in mind, and don't sweat the small stuff. If you know your list is never going to have more than ten entries, even O(n^3) probably doesn't matter much.

What else? Can we craft a reasonably brief Rosetta Stone that summarizes the *common* stuff that every programmer should know, so they know what to look for? What are the principles that are true regardless of programming language, which aren't necessarily taught by the average JavaScript bootcamp? DRY is the heart and soul of good programming IMO -- are there other principles of similar importance?

  • 1

Every program is an attempt to capture and automate a decision-making process. If you don't fully understand the specific process you are working on, nothing else matters.

True -- although that comes with the flip-side, that you often (almost always, in my experience) learn along the way that you *didn't* completely understand that process at the beginning, and need to adjust as you go. *Believing* that you fully understand what you're doing, and being bull-headed about that, is one of the most common failure modes...

(Deleted comment)
Generally speaking, more of people's code is built to detect or respond to error than to do the actual task

Not sure that I *quite* agree with this one -- a lot of the error-checking is pretty well encapsulated these days, so I typically find it a good deal less than half in terms of SLOC. But this might depend on your definition of "the actual task". (I often find that plumbing is the single largest component, try though I may to minimize that.)

But a spinoff that isn't at all obvious to the beginner is that, if you are building something at all serious, testing is more than half the effort. (And should generally be more than half the total code if your automated testbase is sufficient.)

The difference between a trivial project and a serious project is that on every serious project, maintenance and improvements take much, much longer than the initial development phase.

Corollary: anything "clever" or "elegant" needs more commentary, not less.

(Deleted comment)
Hadn't actually come across the SOLID acronym before, although I know its components. Some may be a tad deeper than I'm looking for here, but they're all worth thinking about -- thanks for the pointer!

(I'll have to think about what to say about concurrency beyond "Explicit Threads are your Enemy; avoid them". Beyond that may be getting deeper into the weeds than I want here, especially because concurrency is mostly irrelevant to several major languages. If you're thinking about concurrency, you're probably already at a higher level than I'm focusing on here.)

That programming is, at essence, struggling with the finite ability of the human brain to understand things. Most other principles fall out from this.

Ah, lovely point. Indeed, I think there's a corollary that I totally should point out here, which is that a *large* fraction of serious programming is all about breaking problems down into small, bite-sized pieces that are simple enough that you can be somewhat confident about how they behave, and then using those as building blocks to build larger pieces. Perfection is unlikely, but the goal is being able to understand each component.

The bane of premature or naive optimization of these.

Mmm -- very good point. That may actually be the only point worth making at this level, rather than even worrying about the complexity one.

That programs are necessarily created in service of humans. In the end, some human will need to see benefit, or you don't get paid.

Yaas. The audience I'm thinking about right now (which is mainly entrepreneurs) mostly gets that in their gut, but it's a common enough trap to be worth underlining.

I'd also point to even more fundamentals that they'll be exposed to once they start learning, but should be stressed as places to pay attention to things:

Comment/document. It is possible to go overboard here, but at least have a written sense of what each function is doing for you.

Recursion. Or more generally, strategies for smartly breaking a problem down into smaller bits. Honestly, this is more to do with understanding your logic more than anything else.

Testing. Ways of looking into what is happening. Dump lots of stuff to stdout now, clean it up when everything is working (and perhaps commented).

Re recursion: Recursion itself is just a technique, of course, but yes, absolutely. I'd express this as "all programming is a method of taking a difficult problem and redefining it into a set of smaller, easier problems. Keep going until the smaller, easier problems are things the computer already knows how to solve; once you've done that, you're done.


Start somewhere: People will talk about top down or bottom up; they'll talk about test-driven development, about comments-first development; it doesn't matter as long as you end up with everything you need eventually. The most important thing in order to end up with working code is to have code--and the most important thing there is to start. Write whatever is easiest, when you hit a stopping/changing point, go on to the next thing, whether that's tests or comments or the next piece of code. The hardest place to code from is often a blank file (this ties nicely into refactoring, since the code even after it functions might not be super-efficient, and that's ok).

Computer time is cheaper than programmer time: Efficiency can and often will matter, but for far too many things, it's cheaper -- in terms of effort, and also if money is involved money, to write things in ways that are easier and faster for the programmer than it is to do things that are easier/more efficient on the iron. Sure, you might need that extra bit of speed, but chances are, you wont.

You Won't Need it: That brings us to the great dictum of XP: You Won't need it. Try to favor working code over coding abstractions that you don't need yet, pretty much always. Sure, if an abstraction is easy and the natural place to go next, you can code it in advance of direct necessity. But it's much easier to take working code and refactor it to the more perfect abstraction than to build the abstraction and have it complicate and obfuscate your code long after it becomes obvious that you'll never use it.

ETA: Fail fast. In general, you don't want to code to check assumptions and cover every possible case at every point; that way lies madness. Instead, rule out the impossible and incorrect cases as early as possible in a given branch, then write the rest of the code assuming the inputs are appropriately correct. That way you're dealing with the bad cases in as few places as possible. (Exceptions and exception handling are a special case of fail fast, as using them avoids having to have lots of code everywhere checking for and passing up errors).

Edited at 2016-05-11 04:09 pm (UTC)

Computer time is cheaper than programmer time
And you should have good tools for telling you when it isn't.

The hardest place to code from is often a blank file

Heh. Yep, this is one of the ones I've learned painfully over the years. I do allow myself a fair amount of time thinking about the problem, but that hits diminishing returns fairly quickly; at that point, you learn more by starting to actually code.

Try to favor working code over coding abstractions that you don't need yet, pretty much always.

Mixed feelings here, since I violate this one all *over* Querki's codebase. But I have the odd position of being both the Architect and Product Manager, so I know which abstractions I'm *likely* to care about down the line, and are worth spending the time on now. That's an unusual situation.

*nod* This aspect of "You won't need it" is tricky, partially because it depends heavily on what your "project" actually is. If what you're fundamentally designing is a language, then so you have it. But if you get tangled in the weeds, you'll never be able to spend enough time on gardening.

Yep -- it's always about laying just enough foundation to build on later, but asking "do I need to do this *now*?" of pretty much everything.

The classic example for me is Querki's identity-management system, which is *wildly* more sophisticated than I need yet; indeed, more sophisticated than nearly any other system I know. But I know where I'm planning on going later this year, and what my long-term objectives are in terms of privacy, and fixing the core abstractions later would have been *extremely* difficult and painful, so it was worth spending the infrastructure time upfront on getting the bones right...

APIs are UIs. A programmer designing an API should be familiar with, say, The Design Of Everyday Things (maybe there are better books for this by now?), and design the API to be a comprehensible tool.

Absolutely true -- indeed, one things I've learned over the years is that comprehensible UIs and APIs tend to go hand-in-hand. If the API abstractions aren't clean and comprehensible, odds are good that the UI won't be either.

Another way to word it: User-friendly interfaces are important. The vast majority of code isn't intended to be used by a human, but by other code, so it needs to be friendly to that user.

Edited at 2016-05-12 11:04 am (UTC)

Test-driven development is awfully useful, even at the absolute beginning level. A large fraction of absolute-beginner errors amount to "failure to understand the requirements," and writing tests is a good way to uncover such misunderstandings before you've wasted a bunch of time writing code. It also helps to uncover unfriendly API's -- if I keep stumbling over the order of arguments when I write tests for this function, other callers probably will too.

Tests need to be FAIR: fast, automatic, independent, and reliable. Fast and automatic, because if tests are too time-consuming or too much of a pain to run, you won’t actually run them, and they’ll do you no good. Independent, because if side effects of one test can change the outcome of other tests, it enlarges the debugging space exponentially (and yes, I mean that literally). Reliable, because if a test sometimes fails for reasons having nothing to do with the program being tested, people lose faith in it and it does you no good.

It's a whole lot easier (both simplicity of code and independence of tests) to write tests for functions whose inputs and outputs are explicit -- ideally, parameters and return values -- than for those that depend on global state, hidden state, or I/O, and/or produce results in global state, hidden state, or I/O. Which leads to the corollary "segregate interesting processing from I/O."

Edited at 2016-05-12 11:03 am (UTC)

It's a whole lot easier (both simplicity of code and independence of tests) to write tests for functions whose inputs and outputs are explicit -- ideally, parameters and return values -- than for those that depend on global state, hidden state, or I/O, and/or produce results in global state, hidden state, or I/O. Which leads to the corollary "segregate interesting processing from I/O."

When possible, absolutely so -- indeed, the same reasoning leads you down the road towards functional programming, for generally good reasons.

But I'm particularly conscious of this point this week, having *finally* figured out how to enhance my functional-test harness to deal with Email Interception. One of the problems keeping my functional tests (which try to play scrupulously "fair", mocking as little as possible) too primitive was that the key invitation workflow involves email in the middle of it. Figuring out how to stub that so that the tests could "receive" the email (and parse out the critical link from it) was a real headache. Testing is hard work; functional testing is a *lot* of hard work.

(And indeed, the solution turned out to be exactly a version of "segregate interesting processing from I/O" -- refactoring the email-*sending* code from the email-*generating* code so that the sender could be stubbed in the functional-test environment...)

I'm embarrassed to admit that I had never heard of "mocking" until maybe three years ago, and didn't "get" it until well after I started at Google. I think of it as "how do you write tests for your code's interaction with slow or unreliable external systems?" You could try testing your program's resilience against network failures by pulling the plug on the network router while the program is running, but that's just about the opposite of FAIR. So instead you build something that LOOKS like a network router, as far as your program is concerned, but it's actually software and can be pre-programmed to fail in specified ways at specified times.

Which means, in turn, that your software has to be parameterized by the LooksLikeANetworkRouter interface: in normal operation you give it a real network router, while for certain kinds of testing you give it a MockNetworkRouter object that can be told how and when to fail.

For another example, my project has lots of code that uses real-world timestamps. It's notoriously difficult to write unit tests involving real-world timestamps, because the real world insists on moving forward in time from one test run to the next If you rely on a particular section of code taking at least or at most a specified period of time, you've lost the "R for reliable"; if you rely on Sleep(num_microseconds) calls, you've lost the "F for fast". The solution is to parameterize the program with an AbstractClock, one implementation of which is the real system clock, and another implementation is a MockClock that can be read, set, advanced by various amounts, etc.

Which brings us to another rule of thumb that I thought somebody had already mentioned in this thread but I don't see now: any object you get from outside your code (whether passed in as a parameter, returned by a factory method, etc.) should be used according to its interface, not its implementation. For an extreme example, the "factorial" function really should have a parameter of type LooksLikeANaturalNumber, as long as that interface has IsZero, Predecessor, and Multiply functions.

And which reminds me of another rule of testing: for every test, ask yourself where a bug would have to be in order for this test to expose it. If you already have tests that would expose a bug in this specific place, you probably don't need another; conversely, if you have large sections of code in which a bug could hide without any of your tests exposing it, you need more tests. Mocks are good for detecting bugs in the parts of your code that directly interface with the external system that you're mocking.

Edited at 2016-05-13 03:29 am (UTC)

I'm embarrassed to admit that I had never heard of "mocking" until maybe three years ago

I have a standard architectural pattern that I pretty much always use for major programs -- I originally learned it from Tom Leonard at Looking Glass, and then coined the name The Ecology Pattern after I started using it elsewhere. I've written that particular framework in four different languages, as the basis of half a dozen companies, over the past 15 years.

It has several advantages (it was originally developed by Tom mainly to keep C++ compile times decent, and I like it because it's much more rigorous about initialization than many approaches), but one of them is making mocking extremely easy -- it's a variation of the Dependency Injection approach, and insists on a strict separation of interface from implementation. So testing *always* involves building your Ecology out of the real implementations of the components under test, stubbing the interfaces you can ignore, and mocking the ones you want to instrument. Very useful general approach to the world.

And which reminds me of another rule of testing: for every test, ask yourself where a bug would have to be in order for this test to expose it.

Which is related to a general point worth making: debugging is the one part of programming that is actually *science*, and it's worth being rigorously scientific about it. Observe the bug in action; formulate hypotheses; build tests that could prove (and more importantly, disprove) those hypotheses; and see what happens.

I think any experienced programmer knows this deep in their gut, but it's not at all obvious until you've been doing it for a while...

Resiliency trumps simplistic notions of function

Back in the day, we cared about minimizing use of resources and maximizing functionality. But in today's world, RESILIENCY trumps all -- if you don't got it, no one will use your supposedly great features.

Example: Not too long ago my team was devising a new server-side message logging function. I gave the developers direction to save off messages to intermediate SAN storage -- a pretty reliable thing -- and then have a separate background task commit the messages to the destination DB. The reply: Oh no, we can't have people wait to have their messages appear in the log. This was foolish. If the DB was down for maintenance, the entire system would be unusable!

That thinking comes from the old, "enterprise software" way of thinking, where typically everything goes down for maintenance all at one time. A modern cloud system however, needs to maintain as much uptime for as many features as possible, and further needs to assume any dependent system component might be unavailable at any time. By using an intermediate storage approach, the messaging app could still function, even if the persistent store was offline. The actual delay anyone would realize in seeing logged messages would be minuscule anyway, like 5 seconds at worst.

At least two people will work on your code: you, and you in 6 months' time. And you in 6 months' time won't remember anything about how it works. So write documentation for everything, and be generous with inline comments.

Yeah, that's one that I think most of us have learned the hard way. Nice way of stating it...

Not by me originally, but I don't recall where I first read it.

Avoid complex expressions, by breaking them down and assigning variables.

So instead of:

if ($foo == 'bar' || (count($biz) > 2 && $biz[2] = 'bax'))

break down to:

$second_biz_is_bax = count($biz) > 2 && $biz[2] = 'bax';
if ($foo == 'bar' || $second_biz_is_bax) {}

The main reason is it's easier to read. I like to state this as 'I'm not a computer -- the computer is a computer' -- having to parse long expressions like that to understand their purpose is a waste of human brain power.

Also, it's much less effort to debug when you're trying to work out why the condition is failing, as you can dump the output of $second_biz_is_bax without copy-pasting actual code. (It may be that people with fancy debuggers can get that anyway, but I don't have a fancy debugger...)

  • 1