Linkedin To Open Source Its Data Lakehouse Management Tool OpenHouse
LinkedIn has announced the open sourcing of OpenHouse – a management framework for data lakehouse. OpenHouse offers a control plane that gives users an interface with managed tables in open-source data lakehouse deployments. Now with the open source availability through Github, organizations of all sizes can benefit from the platform’s data lakehouse management framework.
OpenHouse was first introduced by Linkedin last year to power machine learning and analytics workloads. Using data to drive decisions, OpenHouse enables LinkedIn users to gather better job insights and connect with professionals around the globe to expand their network.
The top features of OpenHouse include Fundamental Catalog Operations, Retention Management, and Pluggability. The impact of OpenHouse has been significant. LinkedIn reports that OpenHouse has slashed the time-to-market for LinkedIn’s dbt implementation on managed tables by over 6 months. In addition, the platform has allowed for a 50 percent reduction in the end-user toil associated with data sharing.
The OpenHouse deployments are built on the building blocks of compute engines, metadata catalog, and distributed storage. Until OpenHouse was released, these building blocks operated independently as part of an overall data plane. There was no single system in open source that unified these in a single control plane. This meant that users had to juggle multiple systems and manage tables individually, adding complexity and potential inconsistencies to the system.
With the introduction of OpenHouse, LinkedIn provided an experience that reduces toil for product engineering by enabling users to take charge of tables. In addition, it offers improved developer experience for data infra customers, and enhanced governance for LinkedIn’s data. LinkedIn has already implemented more than 3,500 managed OpenHouse tables in production, serving more than 550 daily active users with a wide range of use cases.
The ability of OpenHouse to offer fully managed, publicly shareable, and governed tables in open-source lakehouse deployments was based on four guiding principles.
The first rule is that the table is the only API abstraction for end-users. No direct access to files or blogs is permitted, as all access should go through a table interface. Secondly, tables are stored in a protected storage namespace that the control plane has full control over. This allows the control plane to be opinionated about different management aspects.
Thirdly, tables are governed based on established company standards and lastly, tables are regularly maintained for optimized performance.
The user workflow includes creating tables, setting table metadata, loading data into tables, and sharing tables with a single chain of API calls, mostly through leveraging standard SQL or Dataframe syntax.
The LinkedIn data lakes fall under two categories: self-managed tables and centrally managed tables. Self-managed tables are private to end users but lack consistent management practices. On the other hand, centrally managed tables offer public sharing calabrese and table management support. According to LinkedIn, 65% of tables fall under the self-managed category, indicating a need for a more streamlined approach.
While centrally managed tables offer consistency, they require an extensively time-consuming onboarding process. OpenHouse overcomes this challenge by eliminating the friction and operational complexities of traditional onboarding processes. This enables users to self-serve the creation of centrally managed and shareable tables that are compliant with the organization’s management practices and policies.
With the open source milestone achieved, LinkedIn now seeks feedback from users to understand how the platform performs in different environments. The company also plans to focus on operationalizing OpenHouse at LinkedIn’s scale and addressing complex technical hurdles as it makes the transition from Hive to OpenHouse.
Related Items
Data Engineering in 2024: Predictions For Data Lakes and The Serving Layer
Navigating the AI Skills Revolution in the Age of GenAI: LinkedIn Report
2024 and the Danger of the Logarithmic AI Wave