In the world of data analytics in 2019, keeping tabs on where bits of information came from, how they were processed and where they ended up at is more important than ever. This concept is boiled down to two words: data lineage. Just as a dog breeder would want to the lineage of a pooch they’re paying for, folks in the business intelligence sector want to know the lineage of the data that shows up in a final work product. Let’s look at the what, the why and the how of this process.
What is Data Lineage?
The simplest form of lineage for data is indexing items with unique keys that follow them everywhere. From the moment a piece of data is entered into a system, it should be tagged with a unique identifier that will follow it through every process it’s subjected to. This will ensure that all data points can be tracked across departments, systems and even data centers.
The concept can be extended significantly. Meta-data about entries can include information regarding:
- Original publication dates
- Names of authors
- Copyright attributions
- The date of the original entry
- Any subsequently dates when it was accessed or modified
- Parties that accessed or modified the data
- Analytics methods that were used to process the data
In other words, the lineage functions as a pedigree that allows anyone looking at it to evaluate where it came from and how it got where it is today.
Why Does This Matter?
Within the context of business intelligence, there will always be questions about the inputs that went into a final product. Individual data points can be reviewed to discover problems with processes or to show how transformations occurred. This allows folks to:
- Perform quality control on both the data and analytics techniques
- Explain how particular insights were arrived at
- Consider alternative approaches
- Refine techniques
- Mine older sources of data using new technologies
When someone wants to pull a specific anecdote from the data, the lineage allows them to get very granular, too. In the NBA of 2019, for example, shot location data is used to study players, set defenses and even choose when and where to shoot. If a coach wants to cite an example, they can look through the lineage for a shot in order to find film to pull up.
The same logic applies in many business use cases. An insurance company may be trying to find ways to deal with specific kinds of claims. No amount of data in the world is going to have the narrative power of a particular anecdote. In presenting insights, data scientists can enhance their presentations by honing in on a handful of data points that really highlight the ideas they’re trying to convey. This might include:
- Providing quotes from adjuster’s reports
- Comparing specifics of an incident to more generalized insights
- Showing how the numbers align
- Talking about what still needs to be studied
Data governance is also becoming a bigger deal with each passing year. Questions about privacy and anonymization can be answered based on the lineage of a company’s data. Knowing what the entire life cycle of a piece of information is ultimately enhances trust both within an organization and with the larger public.
Cost savings may be discovered along the way, too. Verification can be sped up by having a good lineage already available. Errors like duplication are more likely to be discovered and to be found sooner, ultimately improving both the quality and speed of a process. If a data set is outdated, it will be more evident based on its lineage.
Talking about data lineage in the abstract is one thing. Implementing sensible and practical policies is another.
Just as data analytics demands a number of particular cultural changes within an organization, caring about lineage takes that one step further. It entails being able to:
- Document where all the company’s data came from
- Account for who has used it and how
- Explain why certain use cases were explored
- Vouch for the governance of the data with a high level of confidence
At a technical level, databases have to be configured to make tracking lineage possible. Data architecture takes on new meaning under these circumstances, and systems have to be designed from the start with lineage in mind. This can often be a major undertaking when confronting banks of older data. If it’s implemented in the acquisition and use of new data, though, it can save a ton of headaches.
Tracking the lineage of a company’s data allows it to handle a wide array of tasks more professionally and competently. This is especially the case when pulling data from outsides sources, particularly when paying for third-party data. Not only is caring about lineage the right thing to do, but it also has a strong business case to back it up.