database – What are the differences between Data Lineage and Data Provenance? – Code Utility

From wiki,

Data lineage is defined as a data life cycle that includes the data’s origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources.

Data provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins.

It seems that both concepts are talking about about where the data comes from but I’m still confused about the differences. Are both the concepts the same? If they are different, can someone shares an example?

Thanks,

,

From our experience, data provenance includes only high level view of the system for business users, so they can roughly navigate where their data come from. It’s provided by variety of modeling tools or just simple custom tables and charts. Data lineage is a more specific term and includes two sides – business (data) lineage and technical (data) lineage. Business lineage pictures data flows on a business-term level and it’s provided by solutions like Collibra, Alation and many others. Technical data lineage is created from actual technical metadata and tracks data flows on the lowest level – actual tables, scripts and statements. Technical data lineage is being provided by solutions such as MANTA or Informatica Metadata Manager.

,

Data Provenance is,

data lineage (what is the genealogy,history of its journey, where did it begin, how did it come into being, how did it change over time, where has it been, systems it has traveled, any loss or gain) (i.e. data oriented, metadata)

PLUS

the inputs, entities, systems and processes that influenced the data (i.e. process oriented) which can be used to reproduce the data.

,

See this section in the Wikipedia articl on provenance: https://en.wikipedia.org/wiki/Provenance#Science. It links to collections of academic and industry work on provenance.

To succinctly answer your question: in general, there’s not enough context known to differentiate between data lineage and data provenance. Within a specific context, you could look for, or create, specific and possibly different, definitions.

,

Data Provenance is the point of origin for the data term, Data Lineage is the complete data transformation journey from point of origin to current observation point in system.

,

I believe a more simple explanation is who owns it, who touched it, and where is it going.

In a Business sense, that can be summed up in Data Flow Diagrams.

In a Technical sense, that’s a whole lot of baggage to start adding onto data as it flows from system to system. There has to be some HUGE justification to carry that mountain around and for what purpose? To see some pretty graphs? Not going to happen in large real world environments. The justification in $$$ for what??

It’s one thing to tag data with a simple 2 – 4 byte code of origin as it moves from system to system, but to keep all of that other technical jumbo, the cost in system performance degradation / dasd / backups / etc. for a pretty graph? No way….

,

Data Lineage Vs. Data Provenance: Goals
The key goal of a data lineage tool is data lifecycle management right from the data origination to the data exhaustion.

On the other hand, the key goal of data provenance is to specifically track the data origination and segregating data in three key stages. These stages are data-in-motion, data-in-process, and data-in-rest.

Data Lineage Vs. Data Provenance: Components
The key components of data lineage include a web portal, data capture sources, and data nurture methods. These components also include data qualification systems, CRM systems, and an ERP system.

While on the other hand, the key components of data provenance include all the data lineage components and some more. These additional components are tracking the capture sources and data input methods.

Data Lineage Vs. Data Provenance: Challenges
Key challenges of data lineage include managing large volumes of data. It also includes maintaining data lineage, tracking cross channels, and unifying disparate promotional systems.

While the key challenges of data provenance include the data lineage challenges and a few more. The additional challenges include large and complicated workflows, and reproducing the execution for data retention.

Here’s the link to the complete post.

,

Definition

let me highlight what I believe to be the critical part of data provenance that is not found in the definition of in data lineage:

providing a historical record of the data and its origins

Though the wording is different, I believe this addition is the only relevant difference in how provenance and lineage are defined.

Interpretation

The interpretation that I follow, and that I have seen used often in a Big Data context, is that lineage shows you which path the data has taken, but provenance allows you to know what the data looked like along the way.

Example

If you have a workflow that does this:

Gather input from source a, b > combine to c > update in ‘random’ fashion and store in d

Then I would say lineage allows you to know that the data went from a, b through c into d. Deep lineage would even allow you to see the logic used for this. However, this may not let you know what c looked like, in the theoretical random example this is hopefully clear, but in practice there are less random situations, but many irreproducible situations to the point where it might as well have been random.

Now provenance would keep track of the path taken, and on top of this also what the data looked like in c.

Note on implementation

As others mentioned, tracking and storing provenance can be a heavy burden, but it can be great to assist in development, especially of streaming data flows (it is like having a debug point everywhere). Furthermore there may be cases where the provenance is so important (or the data volume and number of transformations comparatively low) that one may want to keep the provenance for a certain period of time.

In practice provenance is not kept as long as lineage, but some tools like NiFi do capture it out of the box, keep it for a short while where it is most valuable, and in parallel track the normal lineage.


Full disclosure and disclaimer:
Though I am an employee of Cloudera, a company much involved with Governance, Lineage and products like NiFi, the description above is based on my personal experience, and from talking to colleagues and customers about Lineage and Provenance.

Leave a Comment