Yellowbrick Claims Flash Breakthrough with MPP Database
Yellowbrick Data emerged with a bang from stealth today by pairing a new massively parallel processing (MPP) analytic database designed to run exclusively on NVMe flash drives along with a claim that it alone has finally cleared a major hurdle surrounding the use of flash for big data analytics.
Founded by former Fusion io executives and backed with $44 million in venture funding, Yellowbrick Data claims that it developed technology that allows an MPP database to reap the full I/O benefits from solid state drives (SSDs) – which is something its CEO says no other data warehousing company has come close to doing.
“People took these database platforms that were running data warehouses and stuck high-performance SSDs on them,” says Yellowbrick Data CEO and co-founder Neil Carson, who was formerly the CTO of Fusion io. “They were expecting, well flash is denser than hard disk drives. It’s smaller, it’s faster, it’s lighter. Our data warehouses should go 10x to 100x faster.
“But basically it didn’t work and that’s because the core architecture of all of the databases were built around spinning disk at the end of the day,” Carson continues. “Maybe it went a little bit faster, but it didn’t really justify the investment from the economic standpoint.”
Plotting Flash DWs
Carson and his team of engineers from Fusion IO, IBM‘s Netezza, Microsoft, Snowflake, and others used flash’s initial analytics failure as the starting point for a new endeavor called Yellowbrick Data.
For four years, the Palo Alto, California company quietly architected and built a new analytic database — as well as a new server designed specifically around flash drives that features almost no RAM and would not cache any data in memory. The new architecture was highly reliant on connecting flash drives directly to Intel processors over the PCIe bus, while a Mellanox Ethernet interconnect provides scalable bandwidth for scale-out architectures.
And before it decided to share its story and declare itself open for business, Yellowbrick installed and tested its new flash data warehouse with prominent reference customers, including TEOCO Corporation, Symphony RetailAI, and Overstock.com. “Our goal is basically to go after a giant chunk of this data warehouse market and create a pretty big storm there,” Carson tells Datanami.
Armed with one of the company’s integrated appliances, a customer can run interactive OLAP queries on a petabyte of data, a feat that Yellowbrick claims would require six racks of its competitors’ offerings. The Yellowbrick Data Warehouse, the company claims, is 30x smaller than a comparably configured scale-out MPP database. Installed on an equal number of racks, the Yellowbrick road would run 140x faster, the company claims. Its performance on benchmarks, such as TPC-DS, has yet to be seen; Carson says the company is focused on serving customer needs.
After one week of testing the Yellowbrick Data Warehouse, TEOCO, which consolidates international billing for telecos like AT&T, decided to go with Yellowbrick. “We’re replacing 24 racks of our prior solution with a half of a rack of Yellowbrick appliances,” the company’s CEO and chairman, Atul Jain, states in a press release.
Yellowbrick is targeting Fortune 500 and Global 2000 firms who may be less than happy with their data warehouses from Teradata, IBM, Oracle, Micro Focus (Vertica), and Amazon (Redshift) – or at least unsatisfied with how much they’re spending to get the required performance.
“The Yellowbrick Data Warehouse offers Overstock a solution to the challenge of fast analytics at a reasonable cost,” says Don Boling, manager of business intelligence at Overstock.com, in a press release. “It is a unique combination of simple deployment with fast and scalable analytic execution.”
Native Flash Query
Yellowbrick Data Warehouse is the first analytic database built and optimized for flash memory from the bottom up, Carson says.
“The key architectural shift we made there is something called Native Flash Query, or NFQ,” he says. “It’s our new technology that allows us to basically run analytic queries against flash just as fast as an in-memory database, like SAP HANA, or even faster.”
Customers can start with a Yellowbrick appliance equipped with 50 TB of storage, by way of Samsung SSDs (Samsung is also an investor). As their data grows, they can scale it out to handle petabytes of data, Carson says.
The Yellowbrick Data Warehouse speaks standard ANSI SQL. It’s based on the Postgres SQL dialect, which means it should be drop-in compatible with IBM Netezza and Amazon Redshift, which utilize the same Postgres SQL interface.
Queries that were so large that they needed to run as pre-scheduled batch jobs can now run at interactive speed, Carson says. Users can also continue to use all the same ETL (extract, transform, and load) tools and processes they’ve developed, which provides continuity with prior BI investments.
“You can install it and crank through it at blazing speed to get answers to business questions,” Carson says. “If you’re using Tableau or Microstrategy, you can just fire it at this this system and interactively look at a petabyte of data at rates that nobody has seen before.”
Share-Nothing MPPs
Despite all the advances that have been made in flash storage over the years, analytic databases were stuck in a legacy architecture that limited their capability to utilize the tremendously fast I/O properties of flash. While many storage companies are selling screaming-fast solid state storage arrays with petabytes of data and IOPS ratings that go through the roof, MPP databases have been unable to take advantage of them because of their design. For that reason, OLTP databases, as opposed to OLAP databases, have been the big beneficiaries of the flash speed up, Carson says.
“An MPP database is a shared-nothing architecture, which means every node of the system has its own storage,” Carson explains. “So to get high performance out of parallel scale-out database, you can’t have centrally shared storage, like a flash array. You could do one flash array per node, if you have a 30-node cluster, by buying 30 Pure Storage arrays.
“But that leads to two problems,” he continues. “One is you’ve got 30 Pure Storage arrays and the cost of doing that. Secondly, the database is still a legacy database that can’t take advantage of that much I/O. It still expects loads of memory and the data to be cached in memory and all the same issues of databases from the previous century basically, because that’s all you can get.
“Yellowbrick is the only company that’s built the core data processing and execution inside the database from scratch to work with Flash as native flash query architecture,” Carson says. “We’re the only company that has that tech. So you won’t get the economic benefit of that flash unless you have a brand-new database to go with it.”
Yellowbrick holds several patents on its technologies, including NFQ, which Carson didn’t want to get into very much. Nonetheless, the company is planning on getting the most out of its intellectual property (IP) investment, which means if you’re expecting to get an open source version of NFQ, you could be waiting for a while.
“We have about a 50-person company. We’ve done everything from building our own blade server, which normally takes a business like Cisco or HP to do it, all the way through to building an entire database, which is normally the focus of a whole company as well. We’ve done both,” Carson says.
“It was a phenomenal amount of work,” he continues. “Going into this, if we knew how hard it was, we may not have actually done it. But we’ve done it now.”
Related Items:
The Future of Storage: Hardware
Which Type of SSD is Best: SATA, SAS, or PCIe?