Five Ways Big Data Projects Can Go Wrong (And What You Can Do About Them)
So your big data project isn’t panning out the way you wanted? You’re not alone. The poor success rate of big data projects has been a persistent theme over the past 10 years, and the same types of struggles are showing up in AI projects too. While a 100% success rate isn’t a feasible goal, there are some tweaks you can make to get more out of your data investments.
As the world generates more data, it comes to rely more on data too, and companies that don’t embrace data-driven decision making risk falling further behind. Luckily, the sophistication of data collection, storage, management, and analysis has increased hugely over the past 10 years, and studies show that companies with the most advanced data capabilities generate higher revenues than their peers.
Just the same, there are certain patterns of data failures that repeat themselves over and over. Here are five common pitfalls affecting big data projects, and some potential solutions to keep your big data project on the up and up.
Putting It All in the Data Lake
More than two-thirds of companies say they’re not getting “lasting value” out of their data investments, according to a study cited by Gerrit Kazmaier, the vice president and general manager for database, data analytics, and Looker at Google Cloud, during the recent launch of BigLake.
“That’s profoundly interesting,” Kazmaier said during a press conference last month. “Everyone recognizes that they’re going to compete with data…And on the other side we recognize that only a few companies are actually successful with it. So the question is, what is getting in the way of these companies to transform?”
One of the big reasons is the lack of centralization of data, which inhibits the ability of a company to get value out of data. Most companies of any size have data spread around a large number of silos–databases, file systems, applications, and other locations. Companies responded to that data dilemma by putting as much of it as possible into data lakes, such as Hadoop or (more recently) object systems running in the cloud. In addition to providing a central place for data to reside, it lowered the costs associated with storing petabytes of data.
However, while it addressed one problem, the data lake introduced a whole new set of problems itself, particularly when it comes to the ensuring the consistency, purity and manageability of their data, Kazmaier said. “All of these organizations who tried to innovate on top of the data lake, but found it to be at the end of the day just a data swamp,” he said.
Google Cloud’s latest solution to this dilemma is the lakehouse architecture, as manifest by BigLake, a recently announced offering that melds the openness of the data lake approach with the manageability, governance, and quality of a data warehouse.
Companies can keep their data in Google Cloud storage, an S3-compatible object storage system that that supports open data formats like Parquet and Iceberg, as well as query engines like Presto, Trino, and BigQuery, but without sacrificing the governance of the data warehouse.
The lakehouse architecture is one way companies are trying to overcome the natural divisions that arise among disparate data sets. But the world of data is extremely diverse, so it’s not the only one.
No Centralized View Into Data
After struggling to centralize data in data lakes over the past decades, many companies have resigned themselves to the fact that data silos will be with us for the foreseeable future. The goal, then, becomes taking down as many of the barriers impeding user access to data as possible.
At Capital One, the big data goal has been to democratize user access as part of an overall modernization of the data ecosystem. “It’s really more about making data available to all of our users, whether they be analysts, whether they be engineers, whether they be machine learning data scientists etc. to just unlock the potential of what they can do with data,” said Biba Helou, SVP Enterprise Data Platforms and Risk Management Technologies at the credit card company.
A key element of Capital One’s data democratization effort is a centralize data catalog that provides a view into a variety of data assets, while simultaneously keeping track of access rights and governance.
“It’s making sure that we’re doing that obviously in a way that is well managed, but making sure that people just have the ability to see what’s out there, and to get access to what they need to be able to innovate and create great products for our customers,” Helou told Datanami in a recent interview.
The company decided to build its own data catalog. One of the reasons for that was that the catalog also allows users to create data pipelines. “So it’s a catalog, plus. It’s very interconnected to all of our other systems,” she said. “Rather than getting a lot of third-party products and stringing them together ourselves, we found it a lot easier to build an integrated solution for ourselves.”
Going Too Big Too Fast
During the heyday of the Hadoop era, many companies spent great sums to build large clusters to power their data lakes. Many of these on-prem systems were more cost-efficient than the data warehouses they replaced, at least on a per-terabyte basis, thanks to the use of standard X86 processors and hard disks. However, these large systems brought with them added complexity that drove up the cost.
Now that we’re firmly in the cloud era, we can look back on those investments and see where we went wrong. Thanks to the availability of cloud-based data warehousing and data lake oferings, customers can start with a small investment and move up from there, said Jennifer Belissent, a former Forrester analyst who joined Snowflake last year as its principal data strategist.
“I think that’s one of the challenges that we’ve had is people have approached it as, we need to upfront do a big investment,” Belissent said. “You get disillusionment. Whereas it doesn’t need to be that way, particularly if you’re leveraging cloud infrastructure. You can start with a single project populated part of your data lake or data warehouse, deliver results and then incrementally add more use cases, add more data, add more results.”
Instead of going for broke right off the bat with a risky big-bang project, Belissent said, customers are better off starting with a smaller project that has a higher likelihood of success, and then building on that over time.
“Historically the industry in general, when talking about big data and expecting people to embrace big data, by definition [means] big infrastructure, and that has set people back,” she said. “Whereas if you think to start small, build incrementally, and leverage cloud infrastructure, which is easier to use and you don’t have to have to have that the upfront capital outlay to put it in place, then you’re able to show the results and you’re perhaps eliminating some of that disillusionment that we’ve seen over previous generations.”
Belissent pointed out that Gartner has recently started emphasizing the advantages of “small and wide data.” It’s a point that Andrew Ng has been making on the speaking circuit when it comes to AI projects.
“It’s not just about big data, it’s about right-sizing your data,” Belissent told Datanami in an interview last week. “It doesn’t have to be enormous. We can start small and scale up, or we can diversify our data sources and go wide and that allows us to enrich data that we have about our customers and get a better picture of what they need and what they want and and be more contextual about the way we serve them.”
Just because the big data project doesn’t need to be massive out the gate, you should still be thinking about the possibility of expansion down the road.
Not Planning Ahead for Big Growth
One of the repeating themes in big data is the unpredictability of how users will embrace new solutions. How many times have you read about some big data project that was pinned as a sure bet turning out to be massive failure? At the same time, many side projects with little expectations of success turn out to be huge winners.
It’s generally wise to start small with big data, and build upon success down the line. However, when choosing your big data architecture, you want to be careful not to hamstring yourself by selecting a technology that becomes an impediment to scale down the line.
“Whether it’s a service and infrastructure business, AI, or whatever–if it’s successful, it’s going to expand incredibly fast,” said Lenley Hensarling, chief strategy officer for NoSQL database company Aerospike. “It’s going to become big. You’re going to be using big data sets. You’re going to have super high throughput in terms of the number of operations going on.”
The folks at Aerospike call it “aspirational scale,” and it’s a phenomenon that’s generally more prevalent among Internet companies. Thanks to the cloud eliminating the need for hardware investments, companies can ramp up the computational horsepower to the nth degree.
However, unless your database or file system can also scale and handle the throughput, you won’t be able to take advantage of the performance on the public cloud. While modern NoSQL databases are easily adaptable to changing businesses, there are limits to what they can deliver. And database migrations are never easy.
There are lots of known failures modes in big data–and undoubtedly some unknown ones too. It’s important to familiarize yourself with the common ones. But perhaps most importantly, it’s good to know that failure is not only expected, but should be welcomed as part of the process.
Not Being Resilient to Failure
When using big data insights to modify business strategies, there are unknowns factors that can appear out of nowhere, rendering an experiment a failure–or even a surprise success. Keeping one’s wits during this fraught process is a key differentiator between long-term success and short-term big data failure.
Science is inherently a speculative thing, and you should embrace that, according to Satyen Sangani, the CEO and co-founder of data catalog company Alation. “We hypothesize and sometimes the hypotheses are right and sometimes they’re wrong,” he said. “And sometimes we’re going to experiment and sometimes we can predict it and sometimes we can’t.”
Sangani encourages companies to have an “exploratory mindset” and to think bit like a venture capitalist. On the one hand, you can get a low but reliable return by making a conservative investments in, say, hiring a new salesperson or expanding headquarters. Alternatively, you can take a more speculative approach that is less likely to pay off, but could pay off in a spectacular manner.
“That sort of exploratory mindset is hard for people to get their selves around,” Sangani said. “If you’re going to invest in portfolio of data assets and AI investments, you’re probably not going to get a 100% of return on your investment for every single individual investment, but it may be that one of the investments is a 10X investment.”
At the end of the day, companies are gambling that they’ll hit one of those 10x payoffs from their data investments. Of course, the chance of hitting data gold requires doing lots of little things right. There are lots of things that can go wrong, but through trial and error, you can learn what works and what doesn’t. And hopefully when you do hit that 10x payoff, you’ll share those learnings with the rest of us.
Related Items:
Why So Few Are Mastering the Data Economy
The Modernization of Data Engineering at Capital One
Google Cloud Opens Door to the Lakehouse with BigLake