Log in

No account? Create an account
Previous Entry Share Next Entry
Big Data, Scala and Spark
[For the programmers, particularly for the architects.]

One of the trends that has happened so fast that I suspect most folks haven't even noticed it yet is the sea change that is occurring in Big Data processing right now. The short version is that, relatively recently, some folks from the Scala world pointed out that while Hadoop is a lot better than traditional RDBMS methods for dealing with data at scale, it still kind of sucks for many use cases. So a project got started to rethink the approach to Big Data around a streaming model. That became Apache Spark, and it is taking over the world with remarkable speed.

TypeSafe has posted a blog entry summarizing the benefits of Spark: it's fairly brief, and worth reading if you have any scaled-data requirements, to understand the strengths of the system. It includes a very concise tl;dr summary at the end. (Note, though, that it is written by Dean Wampler, who isn't exactly objective: his talk at NE Scala a few weeks ago kind of bragged about his self-described trolling of the Hadoop community getting the ball rolling in the first place.)

Querki isn't using this stuff *yet*, and probably won't for a year yet -- I have to focus on more critical-path issues for now. But I suspect I'll be adopting Spark before long, for things like automatic abuse catching. (I already know some of the obvious ways that wikispammers are going to try to game Querki, and a combination of event stream and graph analysis is probably going to be helpful to tame that.) And one of Querki's most game-changing features, App Communities, is going to be all about what happens when you combine Querki with Big Data. I suspect that almost any large-scale JVM-based system is likely to find this stuff useful in some fashion...

  • 1
(Deleted comment)
There are probably a lot more posts on streaming coming, but I'm just getting my head around all that now. Basically, there is a growing consensus that "streaming" is an architectural paradigm whose time has come. There are a lot of different projects doing stuff along these lines, of which Spark is just one.

Indeed, the *really* big deal right now is the emerging Reactive Streams standard, which a bunch of companies including TypeSafe have been working on for the past year or so, and which is just about ready for prime time -- this standard defines a common API for passing "streams" around the JVM in a relatively language- and architecture-neutral way. It is deeply focused on the fairly universal problem of "back-pressure": how do you deal when the data is flowing in faster than the downstream components can handle it? Most people simply hack around this, with solutions that work until they crash horribly; Reactive Streams is a principled approach to thinking about pipelines *as* pipelines, not just as individual hacked links.

(Very simple and concise standard once they worked it out, BTW, but it apparently took a lot of work to get the nuances right.)

This paradigm is taking over with *breaktaking* speed. For example, the next big thing coming out from TypeSafe is akka-http, a new HTTP server layer built entirely around Reactive Streams, with a DSL that allows you to define much of the handling of incoming HTTP data by simply constructing a high-level description of the data flow, and having the system assemble the necessary streams to realize that flow automatically. *Really* neat stuff; Querki will probably get rewritten around it sometime in 2016...

(Deleted comment)
Well, feel free to push my button about that any time: Querki is deeply Akka-based, so I'm happy to burble. (Viewed through a certain lens, Querki is mostly a database server that uses Akka for state caching and data processing...)

  • 1