4 Best Practices for Data Lakes
Updated · Jul 05, 2016
Page Contents
Data lakes sound simple: Pool data or information into a Big Data system that combines processing speed with storage — a Hadoop cluster or an in-memory solution — so the business can access it for new insight. As with so much in technology, though, the reality is much more challenging than the dream.
Part of that is a misunderstanding of what a data lake should be, said the man who coined the term, Pentaho founder and CTO James Dixon. He never intended data lakes to describe a huge Hadoop repository that pulled data from all enterprise applications.
What Is a Data Lake?
“When people ask what a data lake is, I tell them it's what you used to have on tape. Take what you have on tape and pour it into a data lake and start exploring that data,” Dixon said. “Our story was always only put into Hadoop what you need to; if you want to combine information from the data lake with information in your CRM system, well just do a join, do that blending of data only when you need to.”
Despite Dixon's intentions, the term took on a broader meaning with bigger promises. Folks began viewing Big Data lakes as a way to solve integration headaches by bringing all data into one super-fast, easy-to-access repository.
Instead, the repositories turned into slow and unyielding data swamps. Big Data required special expertise to analyze. The conclusions that resulted from using raw data raised red flags about data quality and governance.
“Everybody wanted to look at a data lake as the silver bullet for IT. Has there ever been one? I'm still waiting,” said Nick Heudecker, who researches data management for Gartner's IT Leaders (ITL) Data and Analytics group. “I think once you get beyond that discovery phase, you need to do more. Data lakes, that same infrastructure can help, but you need to go into more of a professional information management world once you used that data to answer the questions that you generated.”
So given the reality of data lakes, how can you utilize them to your organization's advantage? Experts say there are four key data lake best practices:
Understand data lake use cases - Do not forget existing data management best practices, such as establishing strong data governance
- Know the business case for your data lake, as it will determine the appropriate architecture
- Pay attention to metadata
Understand Data Lake Use Cases
To build a successful data lake, enterprises need to throw out the idea that data lakes will allow you to collect all your data in one place. It's also important to understand that data lakes are not a replacement for enterprise data management systems and practices — at least, not given the current state of Big Data technology.
“Organizations are still talking about data lakes but they're also recognizing that all lakes are not equal,” said Jack Norris, senior VP of Data and Applications with MapR. “There's a certain amount of capabilities you need or we've heard people talk about data swamps, where it's hard to get data to flow out or in, it's just stagnating there.”
Given that the data lake didn't work out as planned, is it still viable? Yes, provided you understand its limits, experts say.
“I have a pretty scoped view – I don't want to say narrow – but a very scoped view of what a data lake is,” Heudecker said. “To me, it's a data science sandbox. It's where you play with data and you try to find new insights. Once you've found that new insight, does it make sense to leave data in its raw format? I would argue that it doesn't because you now need to optimize the data. You need to insure that it's governed, that it's semantically consistent, that it will meet the needs of the business consumers so to me the data lake is a lab. And you can do other things with it but for me, when I'm advising clients that's how I try to advise them to think about their data lake.”
That isn't as limiting as it may sound. For instance, Heudecker notes enterprises use data lakes to extract insight from Internet of Things deployments. Philip Russom, research director for data management with TDWI Research, said data lakes can serve multiple purposes, such as providing more flexibility for agile data warehousing and reporting. Data lakes also often serve as a data landing or staging area for Hadoop clusters and data integration.
“In its extreme state, a data lake ingests data in its raw, original state, straight out from data sources, without any cleansing, standardization, remodeling, alteration, etc.,” Russom said by email. “The point of working with raw, unaltered detailed source data, is that the data can be altered on the fly during run time, as new and unique requirements for analytics arise. This assumes that once you change data for a specific purpose, the output data is somewhat limited for other purposes.”
Apply Existing Data Management Best Practices
It is possible to move beyond these simpler use cases, Russom added, but it requires more than dumping data into a data lake.
“There are now users who've been using some form of data lake for years (even on newish Hadoop), and we can learn from their successful maturation. Users have learned that they get more use from a lake (i.e, biz value) when they discretely impose some form of structure onto parts of the lake (rarely the whole thing),” he wrote.
This also means that organizations cannot ignore the hard-learned data lessons from the past 20-30 years when analyzing data lake stores or integrating with enterprise applications. Audit trail, data integrity, data stewardship, governance and ownership all still apply.
Know the Business Case for Your Data Lake
Technologists love to say that IT projects should start with the business, but in this case, it's a critical first step to determining how to build a data lake. The business case doesn't just influence the architecture: It determines it.
For instance, Dixon notes that when the company interviewed early adopters of Hadoop clusters, 80-90 percent of the use cases were of structured data and not of unstructured data. Knowing what your business use case will be and what type of data it requires is key to deciding whether your data lake can be built in a traditional relational database, a Hadoop cluster or another, NoSQL alternative. For example, relational databases work well for IoT sensor data, according to Heudekcer, which means you can save the costs of hiring NoSQL skills.
The business case will also drive whether you need to use some form of SQL support on any NoSQL solution. If the data will be moved to an enterprise analytics tool, then you'll want to consider how to support data best practices.
“It never was about just the data,” Norris said. “It was always about what are you going to do. What are the use cases, what are the applications you can bring to bear on that data to benefit from it.”
Support Metadata
Finally, pay attention to the metadata. Metadata repeatedly emerged as the key to ensuring data lakes are a viable strategy rather than a data graveyard. The good news here is that Big Data and analytics vendors are introducing new tools that support adding metadata to data lakes and other Big Data stores. For instance, metadata injection was a key component in the release of Pentaho Business Analytics 6.1.
“We're getting to the point that people realize Big Data does bring something that you can't do with other data stores,” Dixon said. “Now it needs to behave like other enterprise-grade applications. Now it needs security, it needs monitoring and logging and auditing, and it needs metadata to make it more robust, more usable, more user-friendly. I think it's the outcome of it becoming a more standard tool in enterprise IT.”
Heudecker said metadata is also key to a new trend Gartner sees: making “connections not collections” with data. It's cheaper, easier and more efficient to leave data where it is than to move it to ever-growing clusters or data warehouses.
“The biggest challenge, the biggest thing that enterprises should be concerning themselves with is metadata and metadata management,” he said. “If you have a very good idea of your data's metadata, you can fix a lot of things that maybe you delayed or deferred on when you were busy making things work. So as long as you've got good metadata, you can fix governance. And you can fix security. You can fix any data quality issues.
“As long as you focus on that, that's something you can build a foundation of and then build upon as your requirements evolve and as your understanding of the use cases becomes more definite.”
More Information on Data Lakes
Data governance is an important aspect of data lakes, experts agree. For more information on data governance, see 5 Steps to Boosting Big Data's Value through Data Governance.
Canadian utility BC Hydro uses an EMC data lake for analyzing data collected from smart meters, then feeds it into systems which can detect discrepancies in voltage patterns. This has helped the utility reduce theft of electricity by 75 percent. It is one of four data analytics success stories Enterprise Apps Today wrote about last year.
Loraine Lawson is a freelance writer specializing in technology and business issues, including integration, health care IT, cloud and Big Data.