F2: Data are described with rich metadata

"intrinsic" vs "extrinsic" metadata. Other-than-technical interoperability. Qualification vs. "Quality". Feature detection. Search-engine "rich results".

Hello, and welcome to Machine-Centric Science. My name is Donny Winston, and I am here to talk about the FAIR principles in practice for scientists who want to compound their impacts, not their errors.

Today, we're going to be talking about the second of the fair principles, F2: data are described with rich metadata.

What's the purpose of this? Well, if you don't have an identifier, you still want to find the data. So you would need metadata. The rich part is rich as in plentiful, abundant, having in large amounts, full-bodied. So that's what rich means.

There are a couple of kinds of metadata that you want to pay attention to. One is so-called intrinsic metadata. This would be metadata that's machine-defined or machine-controlled, immutable things like file sizes or the output of your instrument, things that aren't really up for debate. You can decide to include them or not, but they're things that the systems that you use give to you. They're intrinsic to the work, to the data.

There's also extrinsic metadata. So this is more user-defined, user-controlled. These are things like context, provenance, qualifications. We'll be getting to technical interoperability, the I in FAIR. But with metadata, you have things like legal interoperability -- like licensing -- and social interoperability -- like whether you're using domain-relevant community standards and vocabularies. More to come on this when we discuss the R in FAIR. But note that qualification is different than any absolutist measure of quality. So quality is in the eye of the beholder, of the data consumer. So you should be generous with your metadata.

You never know what someone might find relevant or not and how they will use it. This is akin to feature detection in web browsers. Feature detection in web browsers helps to say, "Is this certain block of code supported?" And if it is then do something, and run something else if the browser doesn't support this block of code.

Likewise, machines may look for certain metadata elements and proceed based on their presence or not. For example, Google Dataset Search will index datasets depending on whether they are marked as a schema.org/Dataset, or as a DCAT Data Catalog dataset.

And there are so-called "rich results". So just like the rich, as in the title of this principle, rich results on Google search, like displaying recipes as cards and such rather than just text links to the webpages, these rich results also depend on such feature detection in the metadata records that are served by webpages.

So similarly, if you have machines consuming your metadata, they may look for different features. They may look for different vocabulary elements and such. And so you really want to provide rich metadata, a plurality of things that may or may not be relevant to any given data consumer. And so this helps machine agents to find your data without knowing the identifier.

That'll be it for today. I'm Donny Winston and I hope you join me again next time for Machine-Centric Science.