Log in

No account? Create an account
Previous Entry Share Next Entry
Programming as art: Scala's new Collections API
Okay, this one's for the hardcore code geeks in the audience; anybody else will find it eye-glazing. But if you really love code, check out the Scala 2.8 Collections API. It's a thing of beauty, and a really good illustration of why I so love Scala -- they systematically deconstructed every major collection type, figured out exactly what they have in common and what makes each one different, and rebuilt the libraries to be about as consistent and sensible as is humanly possible. (They were already way better than average in 2.7, but they nonetheless rewrote it under the hood to make it all *right*.)

The result looks about as close to perfect as you can get: as many common base traits as possible (making everything more learnable and robust); focused on immutable collections (for safety and scalability) while allowing you to work with mutable ones whenever you feel the need; lots of consistent operations, so very-high-level functional programming Just Works; abstractions that let you seamlessly work with infinite collections exactly the same way you do finite ones; thoroughly type-safe from top to bottom, using Scala's type inference to catch programming errors without you needing to state types explicitly. I could go on, but you get the idea.

I sometimes talk about programming as my artform, and I mean that quite seriously: I perceive the aesthetics of code just as clearly as I do for painting or music. The Scala project has won me as an adherent on that basis, perhaps more than any other -- there is a common devotion to crystalline elegance in both the language and its tools. That lets me program *better* on a practical level, but beyond that, it's just plain more *satisfying* a language to work in...

  • 1
Huh. A lot of the things they've done there correspond with thoughts I'd had for That Language That Should Exist. I don't have quite as much of a focus on immutability, though there are certainly uses for which it allows greater optimizations, but their mutable support looks very good as well.

It seems the only data structure features missing here from my idealized language are multi-dimensional array operations (sorta APL) and built-in tree/graph operations (sorta XSLT).

I haven't actually used Scala for anything yet, but it is looking better all the time...

The heavy focus on immutability makes sense when you consider that Scala is very deliberately a bridge language, straddling procedural and functional styles. Immutability doesn't matter so much for traditional procedural designs, but it's absolutely core if you're trying to do serious functional programming. Also, immutable structures are typically much more straightforward to parallelize. So if they can actually get the immutable structures within the same order of magnitude of performance as the mutable ones, it starts to become a performance win. Put those together, and I'm starting to buy into the "immutable is usually better" mindset.

Anyway, I do recommend the language. My current plan is to start messing around with Lift and Akka (the web and distributed-actor frameworks, respectively), with an eye towards possibly getting CommYou rebooted as an open-source project...

Hmm. I think that the distinction between mutating and non-mutating operations may be more crucial than the immutability of the structures themselves. Clearly it's much easier to parallelize non-mutating operations, whether they're on mutable or immutable collections.

Now, it's much easier to assure that operations are non-mutating if they are working on a collection which is itself immutable. It may not be generally possible for the compiler to otherwise assure that, in fact. It may be necessary for the programmer to assure the safety of the operations they perform, which is less than ideal.

But, I think that various mutating operations are essential to any non-trivial application, and faking those mutations with a copy-on-write can get very expensive. So it's difficult for me to buy into the "immutable is usually better" philosophy. Within restricted spheres of operation, immutable is clearly safer, and the internal structure of the collection can even be optimized a fair amount if it will never change, but I'm inclined to think of the immutable form as a transient state that you put your collection through in order to do some things to it. Perhaps functional programmers have the inverse conception...

faking those mutations with a copy-on-write can get very expensive

It can, but folks have made a lot of progress on reducing that expense. One of the most interesting components of the new Collections document is that they have actually gone through and characterized the efficiency of each structure by conceptual operation, so you can do a decently apples-to-apples comparison.

The main upshot is that the immutable structures are mostly at least big-O comparable to their mutable counterparts at this point. In particular, the major Map and Set operations are effectively constant-time for both the mutable and immutable versions. I suspect that the constants aren't the same, but that's usually an good tradeoff if you buys you better scalability.

I can't say I know the full details, but I gather that the data-structures world has moved on to the point where creating a "new" immutable copy, for most of these structures, is more a matter of creating just another shell around the same largely-shared data pool. You're not creating a full new collection, just essentially a diff from the old one, so it's actually pretty fast. It's demanding on the garbage collector, but that's pretty mature technology by now.

(At this point, people are even writing databases that way. I gather that, eg, CouchDB and its ilk are essentially "write-only databases" built on the same principles, so that any pointer into the DB always sees a consistent view, if possibly an obsolete one. My impression is that the result is screamingly fast and highly parallelizable, at the cost of needing periodic compression.)

So yeah, they really do mean immutability as the *primary* form, with mutability used only in constrained circumstances, where you have other reasons to believe that you're not going to hit parallelism issues. A typical Scala example would be in an Actors environment. Each Actor internally is mutable, with vars for its members (since by definition it only has one thread at a time), but all messages it passes should be immutable. So if I'm sending a data structure from Actor to Actor, I should always use an immutable structure, so that each Actor can confidently operate on it without worrying about threading issues...

Interesting. I seem to be behind the curve on a lot of those slicing-up-the-data techniques. I will do some catching up.

The Actor case you mention is a great example for the parallax between philosophies: from my vantage the primary existence of that data is in the Actor, where it is mutable, but you want pass-by-value for handing it off to another Actor. (I can think of examples where I wouldn't want that for a message as well, mostly in a chain of processes that operate on separate parts of some large state-of-the-world object, and want to pass the whole thing to each subsequent stage of the chain.)

I'm sure we could discuss it ad infinitum, but I think it mostly comes down to a question of taste.

Somewhat, yes, and it's interesting to note that my own tastes are changing.

But the Actor thing really is a good example, because you can't think of pass-by-value as being just about how you communicate the data structure -- it's really about how you are *sharing* the data structure. That is, you're passing an immutable pointer, but the important part is that both of the Actors working on that structure should be able to operate on it without worrying about each other. That's difficult to do with mutable structures unless you get into the usual mess of complex locking schemes, or copying the object entire, and making an immutable copy of a mutable structure *is* expensive. But if you think of it as immutable from top to bottom, each Actor can reason about and operate on the structure completely independently, without worrying about what the other guy is doing.

My gradual shift in taste is being driven by these sorts of considerations. Almost every serious project I do is heavily multi-threaded, and I usually spend a ton of time worrying about synchronization issues. (And cleaning up after other programmers who don't quite know what they're doing.)

So I'm increasingly serious about finding architectures that are inherently robust in the face of parallelism, requiring simpler rules to not screw up, since I think that's the way things have to go. Actors seem to be one of the two really good answers I've seen to that. (MapReduce is the other; they seem to be good for different sorts of problems.) And if you follow a discipline of using immutable collections, Actors become significantly safer, since there is less danger of accidentally exposing mutable objects...

  • 1