Real-Time Streaming for ML Usage Jumps 5X, Study Says
A new survey sponsored by Lightbend found that implementations of real-time stream data processing systems for machine learning and AI applications jumped by 500% over the past two years. It also found that Kafka is almost ubiquitous, that concerns about state handling must be addressed, and that containers are prime locations for fast data systems.
Lightbend, which develops a real-time streaming data stack based on Akka technology, commissioned The New Stack to query 804 IT professionals about their use of real-time stream data processing. The survey dis covered that the most common applications for fast data tech (which includes real-time processing that also supports batch) haven’t changed a whole lot.
Application monitoring and log aggregation workloads still dominate the leader boards for fast data use cases. “Application monitoring and log aggregation are the top use cases because it is essential to detect problems quickly instead of waiting for offline analysis,” Lightbend states in the survey. “Instead of storing all of the raw data generated by modern applications and systems, the data is often aggregated and stored into time-series databases that only store metrics that can be easily analyzed.”
But when the investigators asked about ML and AI use cases, they found a juicy little nugget. It turns out that fast data implementations in support of ML and AI jumped from 6% of the total survey pool in 2017 (the last time Lightbend conducted this survey) to 33% in 2019. In whole numbers, that represents nearly a five-fold increase in the use of fast-data systems for ML and AI workloads over the past two years.
The most common workload for fast data, incidentally, is extract, transform, and load (ETL), which is used in 36% of fast data implementations, according to the survey. Compared to application monitoring and log aggregation, ETL is a relatively new entrant to the world of fast data, the company says. “ETL and data warehousing are old problems for which streaming is now being applied,” it says.
All told, there has been widespread increase in development and rollout of fast data systems across a variety of industries. In addition to ETL workloads, Lightbend sees the fast-data numbers sharply ticking up for related use cases, such as processing data emanating from IoT pipelines as well as the integration of data different data streams.
While each use case is different, Lightbend seems a common goal in many of these fast data implementations that involve some sort of data integration and transformation: ML and AI. The quicker organizations can get the treated, integrated, and cleansed data in front of the algorithms, the more value can be extracted from the data.
“We see a renaissance right now where developers are being asked to be a lot more ‘data smart,'” says Mark Brewer, the CEO at Lightbend, in a press release that accompanied the release of today’s report. “Streaming data is table stakes for the most interesting future use cases — artificial intelligence and machine learning most notably — and that’s giving rise to the number of programming languages, frameworks and tools for building and running streaming data-centric applications.”
AI/ML use cases for fast data projects are projected to increase nearly 33% in the next 12 months, the survey found. That trounces the next two most-anticipated fast data use cases, application monitoring and integration of different data streams, which are projected to increase by 17% and 16%, respectively.
That doesn’t mean there aren’t headwinds with fast data, which has seen its ups and downs over the years (although nothing compared to the multi-decadal winters that AI has experienced). Some of the most often cited challenges to implementing fast data are developer skills (cited by 31% of survey takers); complexity of tools, techniques, and technologies (30%); choosing the right toosl and techniques (26%); difficulty making changes (26%); and integration with legacy infrastructure (24%).
Another obstacle to building fast data systems cited by the survey takers is state. In particular, stateful stream processing systems – where the state of an event represented in the streamed data that preceded the current event impact how that event is handled – is something that must be handled carefully, particularly when working to scale a system up. Stateless stream processing – where every event is handled independently of all other event – are easier to build and scale.
More than 60% of survey takers said state is “greatly” or “to some extent” impacting the deployment of more applications within microservices architecture, Lightbend’s survey found. However, the survey also found that, as organizations become more familiar with stream processing and build more streaming applications, they view state as less of a barrier.
It’s quite common for users to integrate a streaming data system with persistent data stores and databases, according to the survey. Elasticsearch, Cassandra, Postgres, MongoDB, and Hadoop were the five most popular databases or data stores to be integrated with stream processing systems, the survey found.
“Users of modern data stores like Cassandra, MongoDB and Redis are less likely to believe state is inhibiting adoption of microservices,” the survey found. “However, some of the most common technologies used with stream processing are deployed by those who believe handling state is greatly inhibiting microservices adoption,” it stated, citing high usage of Kafka, Spark Streaming and Elasticsearch among this cross-section of the sample.
There is also a strong correlation between the adoption of Kubernetes and stream processing. Nearly 70% of IT professionals say it’s somewhat likely or extremely likely that stream processing “will be deployed in the same stack as a container orchestrator like Kubernetes,” the survey found. That’s a lower figure than competing container approaches, such as AWS’ lambda, it found.
Apache Kafka is the backbone for the majority of fast data applications, according to the survey, with nearly eight in 10 of respondents saying they either have it in production (49%), are evaluating or piloting it (19%), or plan to look into Kafka (11%). Other popular frameworks in use include Apache Spark Streaming (25% in production), Lightbend’s Akka Streams (23% in production), and Apache Nifi (9% in production).
When it comes to programming languages used to build fast data systems, Java is the big winner, with 75% of survey respondents saying they use Java. That should come as no surprise, considering that Kafka and Spark were both developed in Java and Scala, which is a JVM-compatible language. JavaScript was the second-most popular language used in fast-data sue cases with a 47% share, followed by Scala at 45% and Python and 32%.
The always-on nature of fast data and real-time streaming systems make it a natural fit for today’s competitive organizations, who want to make the most out of data. As streaming and fast-data technologies improve and mature, they will become an even more important part of organizations’ data infrastructures.
You can download a copy of the report here.
Related Items:
How Disney Built a Pipeline for Streaming Analytics
Spark Streaming: What Is It and Who’s Using It?
Fueled by Kafka, Stream Processing Poised for Growth