LinkedIn Unleashes ‘Nearline’ Data Streaming
LinkedIn is releasing its Brooklin data ingestion service to the open source community.
Brooklin has been running in production on the social media platform since 2016. The stateless and distributed service is used primarily for streaming data in near real time—also known as “nearline”—at scale. LinkedIn estimates the service handles more than 2 trillion messages per day as well as thousands of data streams.
The impetus for developing the data service was driven in part by growing demand for low-latency data pipelines that would scale. “Moving massive amounts of data reliably at high rates was not the only problem we had to tackle,” said LinkedIn’s Celia Kung.
“Supporting a rapidly increasing variety of data storage and messaging systems has proven to be an equally critical aspect of any viable solution,” she added in a blog post announcing the open-source release of Brooklin.
LinkedIn has been using Brooklin to stream across a variety of data sources, including Expresso and Oracle along with messaging systems ranging from Amazon Web Services Kinesis and Microsoft Azure Event Hubs to Kafka.
Among the use cases for Brooklin are as a “streaming bridge” and an upgraded version of Kafka mirroring. In the first use case, LinkedIn touts Brooklin as a means of streaming data across different cloud services like AWS Kinesis or Azure. It can also move data between different clusters within a datacenter or across different datacenters.
That feature allows application developers to focus on data processing rather than data movement. The service can also be configured to stream incoming data in a specified format while encrypting outgoing data, the company said.
In the Kafka mirroring scenario, LinkedIn said it previously used Kafka feature called MirrorMaker to shift data among different Kafka clusters. Brooklin allowed developers to consolidate Kafka mirroring into a single steaming data service. The Microsoft unit also uses Brooklin to move large volumes of Kafka data between its internal cloud and Azure.
The tool is used to mirror trillions of LinkedIn messages each day, the company noted.
A related multi-tenancy capability in Brooklin addresses a limitation in Kafka MirrorMaker in which each cluster can only be configured to mirror data between two Kafka clusters. Brooklin is designed to handle several independent data pipelines concurrently, meaning a single Brooklin cluster can synchronize multiple Kafka clusters.
Further, Brooklin’s mirroring feature can detect errors at a partition level and automatically pause mirroring when errors arise.
The data streaming systems also is promoted as providing better isolation of computing resources among applications and online storage. That feature is said to allow applications to scale independently of databases, thereby reducing the risk of a database failure.
The source code for Brooklin is available now on Github.
Recent items:
Kafka in the Cloud: Who Needs Clusters Anyway?
No Time Like the Present for AI: The Journey to a Successful Deployment