R1.2: Metadata and data are associated with detailed provenance

The 14th of the 15 FAIR principles, R1.2: metadata and data are associated with detailed provenance. A dive into the World Wide Web Consortium (W3C) Provenance Data Model -- what are the different parts of provenance, and what are some terms that can be used in order to manage it?

Hello, and welcome to Machine-Centric Science. My name is Donny Winston, and I am here to talk about the FAIR principles in practice for scientists who want to compound to their impacts, not their errors. Today, we're going to be talking about the 14th of the 15 FAIR principles, R1.2: metadata and data are associated with detailed provenance.

I'm going to keep this one simple. Not easy, but simple.

I am a strong believer in the World Wide Web Consortium (W3C) Provenance Data Model.

I will be going over the core concepts involved in provenance as defined by this specification. And this way, I hope to give you a sense of what provenance is -- what are the different parts of provenance, and what are some terms that can be used in order to manage it?

So, in the Provenance Data Model, which is also serialized as the PROV Notation and PROV Ontology, there are three overarching core components. One is entities and activities. There are two types here. One is an entity and one is an activity.

There are four principal relations that involve entities and activities. The first is generation. So, an entity gets generated by an activity. Another relation is invalidation. An entity gets invalidated by an activity. There's also a relation of usage. An activity might use an entity in some way. And there's also communication. So, one activity might communicate with another activity through an entity, or entities.

Some other relations here are about starting an activity. So, an activity might have a trigger entity, and an activity might have a starter activity, which, in turn, through a trigger, will start the subsequent activity. And there's also the relation of the ending of an activity. So, you might have a trigger entity to end an activity, or you might have an ender activity, which, once that's done, the subsequent activity's ended.

So, that was entities and activities. The second core component of provenance is derivation. So, the big relation here is about derivation. One entity can be a derivation of another entity. There are three subtypes that the Provenance Data Model talks about. One is revision. So, this entity is a revision of another entity. This could be further qualified by version numbers. One entity might be a quotation of another entity. And, finally, one entity might be the primary source -- or, a primary source -- of another entity. So, that's about derivations -- relations between two entities.

The final component is where we introduce the third core type of provenance. So, we went over entities and activities, and now we have agents. So, this third core component is about agents, responsibility, and influence. The three big relations here are, 1: attribution. So, an entity might be attributed to an agent that, say, created it, or was involved in its creation.

There's also the relation of association. So, one agent may be associated with an activity -- it might play a role in that activity. Also, an activity might be associated with an entity because that entity would be a plan for that activity.

And finally, there's delegation , and this is between an agent and another agent. So, one agent might act on behalf of another agent. So, there's some delegation.

More generally, there's this general relation of influence. So, this is where everything kind of comes together, where you might have one entity, activity, or agent, and there's some influence on another entity, agent, or activity. And that influence might be one of usage, start triggering, ending triggering, generation, invalidation, communication, derivation, attribution, association, or delegation, all of the relations we went over.

So, again, there are three core types in provenance -- there are entities, activities, and agents. Instantaneous events are put in the context of activities. So, there is this idea of a time instant in PROV. And so you might have generation at the time instant of completion of production. Usage would be an instant at the beginning of utilization. The start of an activity is when it's deemed started, and the end of an activity would be when it's deemed ended. And finally, invalidation as a time instant would be considered at the start of the destruction, cessation, or expiry of an entity.

So, in summary, there are three core types: entities, activities, and agents. There are ten influencing relations. That's not including the three included subtypes of derivation -- revision, quotation, and having a primary source. And, hopefully, if you consult the Provenance Data Model, you'll have this nice controlled vocabulary for how to talk about provenance, and furthermore, with the PROV Notation and with the PROV Ontology, you'll have some ideas about actual serialization -- ways you can record what your field names can be in order to express these three core types and ten relations so that you have a real sense of what provenance means and what bases you might wish to cover in order for someone to decide whether or not to reuse the thing that you've published and how it connects to how they might trust it -- how they might consider it consistent with what they've done, how they might consider it valid based on methodology, and how they might consider it timely, depending on when things were done.

That'll be it for today. I'm Donny Winston, and I hope you join me again next time for Machine-Centric Science.