Ephemeral, Fast Data Finds Home in Memory
For any number of businesses, big data isn’t useful unless it’s fast and responsive data. In other words, the challenges that reflect the volume and complexity sides of the big data coin represent just one face.
Flip that coin over and one finds new, but interrelated problems to explore in terms of big data that is ephemeral, or fast-changing in nature—not to mention systems that require new ways of scaling.
In clearer context, think about modern reservation systems that power large international airlines. These systems intercept a constant sweep of daily changes as conditions and passenger routes or plans change. Or, as another example, a hedge fund that wants to perform real-time risk management to gauge a position in the midst of trading day. Even smart grid operators, with their thousands of solar collectors feeding real-time data reports, have to make snap decisions about energy distribution that relies on big fast data.
Increasingly, these types of uses of fast big data are finding their way off the traditional disk and into memory to take advantage of cheaper RAM-packed commodity hardware and the capability to perform rapid analysis of large datasets close to where the processing power lies.
The potential for higher throughput on some applications and far faster response times is the key for financial services in particular, but according to many of the in-memory vendors we’ve talked over the last year in particular, new markets looking to explore off the disk are cropping up. Accordingly, new solutions and companies with in-memory promises are cropping up daily, some of which are leveraging newer big data frameworks like Hadoop and MapReduce.
While ScaleOut Software has been around since 2003 it spent the better part of its early days tucked away in the HPC and related financial services space, although they also found some customers on the web-scale ecommerce side as well. ScaleOut represents another example of a niche company building out on the promise of in-memory, especially after this morning’s announcement of the ScaleOut Analytics Server, which combines a scalable in-memory grid with an analytics engine baked inside for time-critical analytics. The company is pushing the ease of use angle—something they say isn’t cooked into Hadoop approaches that carry complexity for application developers which can hinder performance.
We talked about the broader in-memory data grid approach to big, changing data with CEO and co-founder, Dr. William Bain. A noted parallel computing and scalability guru, Bain says that big, fast data requires a cleaner, easier approach to scalable in-memory storage approaches that can allow scalability, speedy response times and high availability in one swoop.
Bain says that to address the kind of ephemeral, fast-changing data that the reservation systems, hedge funds, and smart grids requires something that can adapt and scale while maintaining high availability as a built-in component.
Analytics Server, which the company rolled out today, takes aim at these challenges by moving executable code and libraries from the developer’s workstation to grid servers for parallel data analysis. Developers can then define what ScaleOut calls Java or C# invocations to pre-stage the in-memory data grid’s execution environment to take a few manual steps out of the equation.
At the core of this capability are these “invocation grids” which let an application pre-stage the code and supporting libraries—which Bain notes is something Hadoop can’t do. “If you need particular libraries to be loaded, users can specify that and move it from the workstation to the grid, spin up the Java VMs or .net runtimes, pre-load those libraries, so when the invocation loads we can reload the code if changed or rerun if it hasn’t been changed.”
Being able to shift code and pre-stage an environment for analysis automatically presents an attractive option to Hadoop for developers, says Bain. For instance, when users are on a workstation pulling in data from the grid but have their analysis code on that workstation, there is a fundamental gap. To remedy this, the offering automates a Hadoop-like batch submission process outside of the batch scheduling box by automatically letting the application invoke a MapReduce-like operation which automatically shifts that code to the grid for execution.
And at what cost performance-wise, you might ask? Bain says that “Overhead has to be endured one way or another” but this is nothing compared with Hadoop because of the layer of automation of the libraries being shipped to the grid. He says that what you get in the end with this approach is low latency and automation that leads to far greater ease of use.
The ScaleOut CEO says that when it comes Hadoop, there is overhead caused by significant, unnecessary complexity. He claims that in-memory data grids as a whole try as much as possible to leverage the object-oriented programming models of C# and Java and because they’re storing data that’s already chunked up into objects.
These objects usually already have semantics associated with them (stock history objects, reservation objects, shopping carts, etc.) thus they already exist in-memory and have tight associations with the language. In the end, that makes them query-able using the properties defined by the language of those objects. “That’s a much simpler model than the one Hadoop offers in which you store data you maybe retrieve from a relational database server or created and stuck in a big file—you then require the application programmer to chunk that up in a logical way to then serve it to the workers to analyze by creating keys and associated values.”
Next — A Heavy Invocation >
Further, Bain explains that objects to be analyzed typically are organized in the data grid by a language-defined type and selected for analysis by querying their class properties. Called Parallel Method Invocation (PMI), this technique could ease application design by using an integrated in-memory execution model comprised of familiar language modes. This means that users can tap the speed of the in-memory approach without having to dabble in the nasty complexities of parallel scheduling or building analytical algorithms for complex data analysis.
During our discussion this week, the company’s COO, David Brinker talked about the value of object-oriented programming in the offering. Its in-memory data grid stores data as serialized objects, “enabling the use of intuitive, property-based queries to select objects for analysis.” They claim that analysis codes can be developed using standard, object-oriented techniques without the need to understand parallel programming or tune complex data mapping and reduction models.
Brinker detailed how ScaleOut has inserted a MapReduce computation engine that will execute user-defined evaluation (map) methods in parallel on selected objects in the data grid, then combine those results with a user-defined reduce method. “If you think about a traditional HPC grid computing application that requires fast access across a grid of servers to fast-changing data, that’s an in-memory data grid need we’re already filling. What we’re doing now, however, is building in MapReduce functionality so you can analyze the data that is in the grid.”
Both Bain and Brinker concur that other in-memory data grid vendors have an exaggerated emphasis on the storage and acceleration aspect of in-memory data grids and not on the analysis aspects. The company is thus adding itself to a growing list of in-memory data grid vendors who are seeking to keep pace with the growth of Hadoop in the face of their assertions that there are other, potentially more efficient ways to consider in-memory data analysis, especially for ephemeral, fast-changing data.