I3: (meta)data include qualified references to other (meta)data

It's more powerful when our references are indexed by nature rather than by number. On the 11th of the 15 FAIR principles, I3: metadata and data include qualified references to other metadata and data.

Hello, and welcome to Machine-Centric Science. My name is Donny Winston, and I am here to talk about the FAIR Principles in practice for scientists who want to compound their impacts, not their errors. Today, we're going to be talking about the 11th of the 15 FAIR principles, I3: metadata and data include qualified references to other metadata and data.

A qualified reference is a cross-reference that explains its intent. The nature of the relationship is clear.

These links may use domain-specific terminology. You may say that X "is a regulator of" Y, and that's much more of a qualified reference than X "is associated with" Y, or X "see also" Y.

A qualified reference can also use domain-agnostic terms that are nonetheless narrower in meaning than something like "is associated with".

For example, does a data set "build on" another data set? Does it "depend on" another data set for completeness? Or is it "complimented by" another data set? The goal with qualified references is to enrich contextual knowledge, balanced, of course, against the time and energy involved in making a good metadata model.

Another example of qualified reference is versioning. You might say that one dataset is "a version of" the next. Something a bit more qualified would say "this is a prior version of this other dataset" or "this dataset is the next version of this other data set".

A good example of progressive narrowness in qualified references comes from the W3C Provenance ontology. There are a variety of relationships that are between entities, say, publications, datasets. At a broad level, you have that one resource "was influenced by" another resource. More narrow than that might be "was derived from". Even more narrow would be choosing among "had primary source", "was quoted from", or "was revision of".

The references in a scientific article are typically unqualified. They're typically indexed by number. You might have 20 references, and it's just a soup of numbers. There might be some correlation with low numbers and whether that reference is prior art or something canonical or seminal in the field or the subfield, just because a low number of the reference might correlate with it being in the intro or motivation section.

But really there's nothing to tell you about that. You don't know what references in an article are saying this is a seminal work in this field, or this is related work versus something like, we used the dataset from this paper, or, this study builds on this study, or this study contradicts this other study. There's nothing in the unqualified numeric indexing of typical references that lets you know about this. So, again, literature references are often just a list rather than something more powerful like key-value pairs. It's more powerful when our references are indexed by nature rather than by number.

Another example is with webpages. You typically have unqualified links, or rather, they're vaguely, ambiguously qualified. So with webpages, there might be some hint in the text that the webpage author placed between the opening and closing anchor tag for a given link. This text is often colored blue and/or underlined by default. It's usually highlighted.

But is that text labeling the end target of the link? Is it labeling the link itself, the relationship between the current page and the target? Who knows. This is not qualified.

If you look at a table of data, each row has a bunch of links to values, you might say. And they're qualified by the column identity. Imagine if you had a table with five columns and you just put your hand over the column labels at the top of the table. And so you just knew that there were five values per row, and they're all associated with the row ID, but they're not really qualified. It's useful to have qualified cell values.

That'll be it for today. I'm Donny Winston, and I hope you join me again next time for Machine-Centric Science.