Modern Data platforms consider using a data lake over a data warehouse for their storage requirements. However, a data lake lacks in providing support for business intelligence use cases and often organizations end up having a separate data warehouse for their business intelligence use case and operational users. This article explains the different paradigms organizations follow and how Lakehouse solves the problems of both the worlds.
Let’s first start with the difference between a Data lake and a Data warehouse.
Data Lake v/s Data Warehouse
Let me explain the difference using a simple analogy:
Think of a data warehouse as a store of packaged drinking water and data lake as a body of natural drinking water.
The packaged drinking water is directly consumable because the water has been cleaned/filtered and packaged. But in order to consume water from a lake, you have to process/filter it before you can use it. Also, only users who know how to open a bottle will be able to consume water from the bottle. But in the case of the lake even animals can consume water from it. Water in the lake is in its rawest form whereas water in the packaged bottles is processed.
Water in the lake can be used for a variety of different purposes but nobody uses bottled water for cleaning the car and irrigation etc. Bottled water is used to drink most of the time.
The above points clarify a few things:
- Drinking water from bottled water requires some expertise or in other words consuming data from a data warehouse requires some expertise and is meant for a few users, implying that the user base of a data warehouse is limited. In the case of a lake, anybody can consume water from it after doing their own filtration. Similarly, data lake has a vast user base and is not limited to a few users.
- The usage of bottled water is pre-decided (most of the time it will be used for drinking) but the usage of lake water depends on its user. This implies that the purpose of lake water is decided as and when somebody wants to consume water from it whereas a lot of pre-processing has already been done to design the data warehouse. This ties to the distinction between schema on write vs schema on read approach followed by data warehouse and data lake respectively.
- The bottled water is preprocessed but water in a lake requires some cleaning before it becomes consumable and the onus is on individual users to clean it.
- Water in the bottle is confined to some structure whereas water in the lake is in different formats and might be coming from a variety of streams as well. This implies that data lake supports all data types whereas data warehouse only supports structured data.
Now you might be thinking with all the points mentioned above that data lake is the way to go as it supports all types of data: structured, semistructured, and unstructured along with a vast majority of users whereas Data warehouses only support structured information and have a small user base.
While data lake is good for data science and machine learning use cases as it provides all sorts of data for deep analysis, it misses out on the operational use case cases where the data should be well structured to get the key metrics and reports.
And what if you want to derive some business intelligence from the data in your data lake? You probably will load the data from the data lake into any data warehouse for the business intelligence use case. And a lot of organizations do that, but in that case, they end up managing two different siloed storage systems: one for BI use cases and another for use cases such as Data Science, Machine Learning, etc.
Data Lake and Data Warehouse
New modern organizations have data management platforms which combine both data lake and data warehouse in a single entity by performing ETL/ELT on the raw data of data lake and define some structures on it.
So far all good no need to maintain two separate storage entities and now your data warehouse sits inside your data lake which acts as a landing place to ingest raw data. A data warehouse is the portion of the data lake where we enrich and structure the data so that it can be queried directly.
This paradigm copes up with the challenges of having two separate storage systems but we still need to maintain the ETL/ELT layer.
Data Lake + Data Warehouse = Data Lakehouse
But wait there is another paradigm known as Data Lakehouse that is trending nowadays and is bringing the best of both paradigms (Data Lake and Data Warehouse). In simple words, Data Lakehouse enables you to perform queries for BI/reports use cases directly on the data stored in the data lake without having to load it in a separate data warehouse. The processing/structural layer is responsible for making that happen.
This approach saves a lot of operational costs as you don’t have to manage two storage systems simultaneously. It also eliminates the ETL/ELT layer which transforms data from the data lake and loads in the data warehouse as you can now directly have your query engine querying data directly from the lake.
This paradigm helps organizations to move faster and make better decisions.
The processing or structured layer could be AWS Athena, Delta lake, etc. I personally like Delta lake for this kind of functionality as it provides ACID transactions on top of your data lake which enables you to perform some business-critical actions like GDPR DSR requests etc.
Final thoughts: Deciding on which paradigm to use should be a conscious decision and it mostly depends on your use case and how your organization operates.