Data Lakehouse – Changing the Analytics Landscape

Data is the new currency of the modern world. Companies that have learned to amass and leverage the power of data have reached significant valuations on Wall Street, with companies like Google and Facebook being the prominent ones leading the charge.

Our civilization is producing data at an unprecedented rate, and this growth in data generation will not lessen in the near future. With massive data comes the need for efficient storage. Fortunately, data warehouses have been around for a long time to help with the situation. Simply put, a data warehouse is a store of structured information, well organized and usually identified by tables that have linkages to other tables or even other data warehouses so that data can be linked and information can be mined.

The rise of more unstructured data, such as contracts, written documents, pictures, etc., has led to the adoption of bid data platforms. A data lake offers an elegant solution to housing big data. A data lake not only has the ability to store massive amounts of data but also allows for efficient retrieval of both structured and unstructured data. Abundance of data has created a desire to extract more and more insights with machine learning and artificial applications being increasingly deployed to mine the knowledge hidden in the data. With the increasing complexity of data types and even more complex algorithms being deployed to examine them, specialized analytical environments have been established and dedicated to these enterprises. Data teams consequently stitch these systems together to enable business intelligence and machine learning across the data, resulting in duplicate data, extra infrastructure cost and security challenges. These create redundancy in addition to latency gaps between when data is created, stored, transferred, retrieved, analyzed and then handed to decision makers. Therefore, a data lake does have its own sets of limitations.

As such, a data lakehouse (or simply lakehouse) is a relatively new term in the world of data science and engineering. Lakehouse is a combination of both a data lake and a data warehouse, hence the term lakehouse. A lakehouse aims to leverage the strength of the data lake with its ability to store massive sets of both structured and unstructured information with its ability to support real time writing and reading of information.

A lakehouse is enabled by a new system design that allows convergence of multiple systems (both data lakes and data warehouses) onto a single platform and also for multiple use cases from business intelligence, business analytics, data science, machine learning and artificial intelligence projects. The advantages of the lakehouse environment include:

  1. Elimination of simple extract, transfer and load (ETL) jobs
  2. Reduced data redundancy
  3. Ease of data governance
  4. Directly connect to business intelligence tools
  5. Cost reduction

There are now a variety of tools available to support the lakehouse architecture, including:

  1. Google Big query
  2. Apache Drill
  3. Amazon Athena
  4. Delta Lake
  5. Azure Synapse

In a previous article, we discussed Lambda data architecture and its ability to provide timely insights. Lakehouse combined with Lambda architecture is an increasingly popular choice in the big data world today. Data science and engineering teams are rapidly employing this architecture to meet the everdemanding need of processing data faster and at high velocity.

The concept of lakehouse is still at an early stage, so there are some limitations to consider before completely depending upon the data lakehouse architecture, such as query compatibility, data cleaning complexity, etc. However, data engineers can contribute to solving these limitations by leveraging open-source tools and making new innovations available to the larger audience. Bigger companies like Facebook and Amazon have already set the base for lakehouse and open sourcing the tools they use. In addition, with a global crisis like the COVID -19 pandemic, the medical research community can also benefit from making both real time and historic data available for extensive reach in the best form possible and in the most efficient way possible with reduced latency.

For more information about data lakehouse architecture and capabilities, reach out to us at analytics@dhg.com.

CONTRIBUTORS

Amit Arya
Chief Data Officer
amit.arya@dhg.com

RELATED KNOWLEDGE SHARE

© Dixon Hughes Goodman LLP. All rights reserved.
DHG is registered in the U.S. Patent and Trademark Office to Dixon Hughes Goodman LLP.
praxity