Introduction To The Concept Of Data Lake And Its Benefits

Big data does not generate value for you. The generation of value is when we create insights that generate tangible results for the business.

However, creating big data projects do not constitute simple tasks. There are many technologies, but the challenge of integrating a very diverse collection of structured and unstructured data is not trivial. The complexity of the work is directly proportional to the variety and volume of data that must be accessed and analyzed.

A possible alternative to this challenge is the creation of data lakes, which is a repository where it stores a large and varied amount of structured and unstructured data. The massive, easily accessible repository built on date (Relatively) inexpensive computer hardware is storing “big data”. Unlike data marts, Which are optimized for data analysis by storing only some attributes and dropping below the level aggregation date, the data lake is designed to retain all attributes, so especially When You do not know what is the scope of data or its use will be.

It is a new terminology, so there is no consensus as to its name. Some call data hub. We adopt the date lake which is most used term.

With a data lake, different data is accessed and stored in its original form and there we can directly seek correlations and insights, as well as generate the traditional data warehouse (DW) to handle structured data. Data Lake data models (or schemas) are not up-front, but emerge as we work with the data itself. Recalling that in the relational DW, the data model or schema must be previously defined. Data lake, the concept is one of “late binding” or “read schema” when the schema is built on the query time. Comes at a good time because the traditional data warehouse model has existed for some 30 years, almost unchanged. It has always been based on modeling called third normal form and that implies a single view of the truth. It worked and works well in many cases, but with the concept of big data and with increasing volumes and varieties (often unstructured) and the need to be flexible to do unplanned questions, the DW model clearly shows its limitations. It was not designed for today’s world.

For simplicity, a data lake can be imagined as a huge grid, with billions of rows and columns. But unlike a structured sheet, each cell of the grid may contain a different data. Thus, a cell can contain a document, another photograph and other cell can contain a paragraph or a single word of a text. Another contains a tweet or a post on Facebook… No matter where the data came from. It will just be stored in a cell. In other words, data lake is unstructured data warehousing where data from multiple sources are stored.

An innovative aspect of the concept is that, not having the need to define models previously eliminated much of the time spent on data preparation, as required in the current model of data warehouse or data center. Some estimates we spend on average about 80% of the time preparing data and only 20% analyzing. Significantly reduce the preparation time, we will focus on the analysis, which is what, in fact, creates value. How data is stored in its original form without going through previous formatting can be analyzed under different contexts. They are no longer limited to a single data model. In practice, is the model that companies like Google, Bing and Yahoo use to store and search huge and varied amounts of data. And before you ask, the technology that supports the data lake concept is Hadoop. The data lake architecture is simple: one HDFS (Hadoop File System) with a lot of directories and files.

Read More

Manohar Parakh

40 Blog posts