Data Lake and compelling need to Catalogue it!
With TCO (Total Cost of Ownership) as a primary use case, every other organization of the world is on a Data Lake bandwagon. This has managed to address the problem of, increasing cost of managing data, ingesting diverse and disparate data sources, but with no focus on semantics, governance and how data will be used.
Data Lake Dimensions & Organizational Maturity
The very principle of Data Lake gives freedom from rigid modelling and hierarchical schema, doesn’t mean we should store anything, without rules. Governance is not enforced and thus its very absence, makes it a Landfill.
One of the fundamental asks from a data lake has thus dimmed, cross reference data sets from multiple sources and LOBs, analyze them and find answers to the real-world problems. Cataloguing is very important to keep the adoptability quotient high and increases efficiency of Data Lake.
Trust and Quality is the basis for establishing Data lakes. For, if data goes through a screening stage and relevant attributes are captured, it becomes easier to understand and correlate with other data sets for the required business idea.
FOR INTERNAL USE ONLY
Considering the size, variety, scale of Data Lake, catalogue attributes need to be identified and captured dynamically. There may be certain dimensions that apply to all data sets but some may be applicable to only select few.
A Lineage chart linking data (Version) with the Business model helps tracking, which data is being used, who is using and where it is used. This very fact about the uniformity and consistency of the data being received, with actual business usage recorded, makes the content authentic and can be relied upon.
Data Catalogue Attributes
Security is the most important attribute of Data in today’s digital world. Data Lake should be capable of identifying and managing security classifications. Each sensitive record available in the Lake should know by itself who can access it!
One other attribute could be Tagging Analysts Reviews. Aim is to get to know the review and feedback of other analysts who have worked on the same data set in the past. This is an example of capturing data profiling dynamically. This requires a cultural change in thinking, bring dynamicity not only for data creation but also in thought process. These aspects are also important to keep the data discoverable as users come with a background of traditional EDW and tend to use only what is known to them. This defeats the basic purpose of correlation the objective is to create trust and define auto-ways to manage such a large entity.
When to generate catalogue attributes? There is no ground rule, as to do it, pre or post entry into the Lake. This depends on The data set being ingested and the audience. A typical Business end user will rely on Metadata to understand the data set. However, this may not be the case with a technical developer having required capability to relate different data sets programmatically.
Similar understanding applies to Type of data as well. Need for setting up a Schema, up-front at the time of ingestion, is different for a RDBMS data and free-flowing text feed.
To manage such situations, it is suggested to create an extra layer over staging data to store curated data. While ingesting, mark the relevant data sets for tagging and capturing of respective attributes. This demands, continuous refining of the metadata associated with the Data. We have to accept the fact that data ingested ‘As Is’ is not ready for Business, though it surely has contracted the elapse time of ingesting diverse data sets.