Justin du Coeur (jducoeur) wrote,
Justin du Coeur

Big Data, Scala and Spark

[For the programmers, particularly for the architects.]

One of the trends that has happened so fast that I suspect most folks haven't even noticed it yet is the sea change that is occurring in Big Data processing right now. The short version is that, relatively recently, some folks from the Scala world pointed out that while Hadoop is a lot better than traditional RDBMS methods for dealing with data at scale, it still kind of sucks for many use cases. So a project got started to rethink the approach to Big Data around a streaming model. That became Apache Spark, and it is taking over the world with remarkable speed.

TypeSafe has posted a blog entry summarizing the benefits of Spark: it's fairly brief, and worth reading if you have any scaled-data requirements, to understand the strengths of the system. It includes a very concise tl;dr summary at the end. (Note, though, that it is written by Dean Wampler, who isn't exactly objective: his talk at NE Scala a few weeks ago kind of bragged about his self-described trolling of the Hadoop community getting the ball rolling in the first place.)

Querki isn't using this stuff *yet*, and probably won't for a year yet -- I have to focus on more critical-path issues for now. But I suspect I'll be adopting Spark before long, for things like automatic abuse catching. (I already know some of the obvious ways that wikispammers are going to try to game Querki, and a combination of event stream and graph analysis is probably going to be helpful to tame that.) And one of Querki's most game-changing features, App Communities, is going to be all about what happens when you combine Querki with Big Data. I suspect that almost any large-scale JVM-based system is likely to find this stuff useful in some fashion...
Tags: programming, scala

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded