February 26, 2018

Weighing Open Source’s Worth for the Future of Big Data

Alex Woodie

(Kirill Wright/Shutterstock)

The open source software movement began in earnest 20 years ago, when a group of technology leaders in Silicon Valley coined the term as an alternative to the repugnant “free software.” Fast forward to 2018, and the concept has been cemented in our psyches. But does open source have the staying power to drive the next 20 years’ worth of innovation?

There was, of course, open source software before 1998. Linus Torvalds created the first Linux kernel in the open back in 1991, and even IBM engaged in sharing of operating system internals going back into the 1950s.

But in hindsight, the creation of the phrase itself following that meeting, and the momentum that tech leaders consciously built behind open source after Netscape decided to release the source code for Navigator as “open source,” stand out as defining moments in what would become a history-changing worldwide movement.

As it so happens, the story of big data’s technology evolution is inseparable from open source’s story. As Silicon Valley Web titans solved tough tech challenges in the 2000s, they tracked the progress with open source, either by releasing the idea, like in Google‘s seminal MapReduce and BigTable papers, or as an actual product, such as Apache Hadoop.

Nearly all of the groundbreaking products in big data have been open source, and most of them originated at tech giants. Hadoop owes its origins to the Google File System and MapReduce paper, and gestated at Yahoo, while Cassandra and Hive were both created at Facebook. Airflow came from Airbnb, while Storm elevated its game at Twitter.

That’s not to say that everything that matters is an open source version of something developed at a Web giant. Splunk and SAS come to mind as leading tech providers for NoSQL/search and data science tooling, respectively. And Apache Spark came out of UC Berkeley’s AMPLab, making it another exception to the rule. But the vast majority of meaningful big data software products today are open source.

That’s no accident, says Jay Kreps, who created Kafka at LinkedIn and is now CEO at Confluent. “Open source just happens to be particular good at building these open, reliable, well-governed interfaces,” he says. “If you look at the composition of any modern software architecture, just on percentage basis, vastly more of that is open source technology. If anything, that’s increasing as time goes by.”

Standing on the Shoulders of Open

According to Kreps, we’re at a unique juncture in the trajectory of open source relative to the tech ecosystem as a whole.

In the early days, organizations used the open source concept mostly to promote software products that were free and open alternatives to established proprietary ones. That gave us MySQL, which is basically a free and open version of a relational database management system you could get from IBM, Oracle, or Microsoft. It gave us Linux, a free and open version of UNIX that competed with AIX, Solaris, and HP-UX.

But nowadays, we’re more likely to see a groundbreaking product debut as open source. The Kafka messaging system is a perfect example of how open source is leapfrogging the proprietary pack.

“If you’re trying to overcome a technology like relational databases, which have been developed over decades and had gestation from every major university in the world that does computer science research, it takes a long time to climb that hill,” Kreps says. “What’s very different for us is there hasn’t really been this incredibly well-developed infrastructure layer in the space we’re entering. We get to kind of make it up as we go along, which is a huge advantage. “

This perhaps is the reason why — despite the availability of MySQL, MariaDB, and PostgreSQL RDBMs, the advent of modern NoSQL and NewSQL solutions, and scalable Hadoop and object-storage alternatives — proprietary RDBMs continue to drive the lion’s share of enterprise spending in the data management space.

There’s nothing quite like Kafka, which paved the way for it to dominate its category. “The world of enterprise messaging and ETL products – everything in that asynchronous space is not nearly as good as Kafka is already,” Kreps says. “You’re seeing things like Kafka and other open source projects like Kubernetes – they’re actually very different from what was around before. There’s no real proprietary alternative at all that’s a direct analog.”

Room for Proprietary?

There’s no question that open source dominates discussion at the infrastructure layer. Even IBM is adamantly embracing open source software in its proprietary business systems, including the IBM i midrange server and the System z mainframe, as a way to make the established tech more familiar to new recruits. But does that mean there’s no room for proprietary software?

(christitzeimaging.com/Shutterstock)

The truth is that nearly all of the vendors hawking open source wares have enterprise add-ons. Confluent develops proprietary add-ons for Kafka, Cloudera develops proprietary software for Hadoop, and Datastax creates proprietary add-ons for Cassandra. (Hortonworks is one of the only big data tech vendors that eschews all forms of proprietary software; it succeeds or fails entirely on the back of purely open source software).

You’re also more likely to see proprietary extensions the higher up the stack you go. For example, Umbel uses open source technologies like Cassandra, Elastic, PostgreSQL, and Spark with its sports analytics package. Any professional sports franchise could use assemble the same technologies to create their own system, but it would take many years and millions of dollars to build something as good as Umbel’s, according to Kevin Safford, Umbel’s senior director of engineering.

“They’re general purpose tools,” Safford says. “To make them useful for a particular enterprise requires honing that down to your own needs and purpose, so that’s where the in-house and propriety components come in, to make sure it can meet the standards that we have for performance and reliability, as well as to provide the specific feature set that we offer to our clients.”

DataTorrent also has feet squarely in both camps. On the one hand, it backs Apache Apex, the powerful open source streaming analytics engine. But on the other hand, it offers its own proprietary extensions to Apex in DataTorrent RTS, which includes the newly released Apoxi technology that’s designed to glue together various open source components.

“We’ve gotten to this weird point where customers are saying, ‘I want to use the innovation of open source, but I’ve got a problem,'” says DataTorrent CEO Guy Churchward. “They say, ‘I want to use open source and get an application running in a quarter or two. But as the final kicker, I need it enterprise grade.'”

While individual big data components like Kafka might be enterprise-grade, that enterprise-grade rating goes out the window when multiple open source components are brought together to form a working application, he says. That’s the gist behind Apoxi, which stitches together a handful of pre-selected open source products together for customers.

“The open source community is very good at working on their individual components, but they don’t think of an application as an outcome for what the customers really want,” Churchward says. “We’re taking the components in and certifying them. But it’s going to be closed code to make sure we stitch this thing together properly. Otherwise you risk the data integrity side, the enterprise-grade side, and the reliability and lights-out. It’s super critical we do that.”

Future of Open Source

Whether the open source movement can continue to deliver innovation over the next 20 years is an open question. Considering the current momentum it has, it’s probably not smart to bet against it. However, the challenges that customers face in dealing with the increased complexity that goes along with integrating open source projects is not something that can be easily swept aside.

(Stuart Miles/Shutterstock)

But for enterprises that are intent on using big data tech to gain a competitive advantage, there appears to be no alternative: open source software must be used, no matter the complexity.

“Open source has increasingly been a very important part of technology at Bloomberg,” Gideon Mann, Bloomberg’s head of data science, told Datanami last year. “It’s just not possible to stay competitive without doing open source these days, so there’s a lot of open source that we leverage.”

At the core of open source’s success is a virtuous cycle. As open source software gets better, it attracts more users, which in turn helps the product get even better over time. Hadoop creator Doug Cutting touched on the dynamic two years ago at the Strata + Hadoop World conference, where he said open source’s power comes from a lack of centralized control and an invisible Darwinian hand.

“We have people creating new projects,” he said. “Spark’s a great example. It came out of Berkeley as a sort of random mutation. People tried it. Over time they find it’s a useful mutation, as opposed to the six or seven others that we don’t remember. And it’s the successful one that overtook [the others] and it becomes a success.

One could try building a proprietary layer of the stack and go around selling it door to door, Kreps says. “But it’s a very long hard road to get there, and I think it’s unlikely that a startup would be able to make that work,” he says. “But the approach we took with open source is it actually allows the idea to spread much faster, and it takes root, it gets more and more popular. I’m obviously very happy that that’s what we did when we open sourced Kafka at LinkedIn.”

That dynamic continues to play out, giving us immensely powerful open source software. “You see exactly what you hope for, which is an awesome competitive ecosystem of these different products and technologies that are fighting for mindshare and usage out in the world,” Kreps said. “And they compete in large part on quality, whether they solve real problem in the world or whether they’re worth people’s time to learn and put into practice. That’s what keeps the various projects on the toes.”

Cutting On Random Digital Mutations and Peak Hadoop

The Double-Edged Sword of Open Source

Applications: Enterprise Analytics

Technologies: Frameworks

Sectors: Financial Services, Manufacturing, Retail

Vendors: Cloudera, DataStax, DataTorrent, Hortonworks, IBM, LinkedIn, Microsoft, Oracle

Tags: big data, database, Hadoop, Kafka, open source, Spark