Machine-Centric Science | Transcript: I2: (Meta)data use vocabularies that follow the FAIR principles

May 4, 2022 • 6 Minutes

I2: (Meta)data use vocabularies that follow the FAIR principles

The 10th of the 15 FAIR principles, I2: metadata and data use vocabularies that follow the FAIR principles.

Hello, and welcome to Machine-Centric Science. My name is Donny Winston, and I am here to talk about the FAIR principles in practice for scientists who want to compound their impacts, not their errors. Today, we're going to be talking about the 10th of the 15 FAIR principles, I2: metadata and data use vocabularies that follow the FAIR principles.

Datasets use vocabularies. The data is expressed using a vocabulary, and what this principle says is that that vocabulary should be documented and resolvable using globally unique and persistent identifiers so that that vocabulary is findable, accessible, interoperable, and reusable by anyone who's using the data that they've obtained that is expressed using that vocabulary.

Why is this important? Well, when your vocabulary -- when it's not only a controlled vocabulary, but it's FAIR, so it has these global identifiers, you can detect and prevent false agreements where maybe two people are using the same word for a different thing.

So, same string, different thing. And also detect and prevent false disagreements. So, different string, but same thing.

One thing I wanted to go over briefly is some different types of controlled vocabularies. I attended a talk by Heather Hedden at the Knowledge Graph Conference in New York City, and she had a nice breakdown of controlled vocabulary types. So, some common ones are: a term list, which is for ambiguity control. Then, on top of that, you might have a name authority, where you have the ambiguity control plus synonym control. You might build on top of that and have a taxonomy, where you're adding on hierarchical relationships. And then you might add to that and have a thesaurus, where you have ambiguity control, synonym control, hierarchical relationships, but also vague associative relationships, like saying that some concept is "related to" another concept.

And then at the top of, I guess, of expressiveness, of power, you have ontologies, where you have all of the things about term lists or name authorities or taxonomies or thesauri, but you also have precise associative relationships. You have semantic relationships, so not just "related to", but those relations have semantics.

So these are all variants of controlled vocabularies. So, when this principle is talking about vocabulary, it's talking about all of these things.

Another talk at the Knowledge Graph Conference that I found to be useful for this principle was a talk by Teodora Petkova on the dialogic potential of the Web of Data. She spoke about the importance of dialogic orientation, meaning an orientation towards long-term relationship-building and adding to a network rather than pruning it. So, relating not replacing.

"Dialogic" is in terms of dialogue as expounded by physicist David Bohm, so Bohm dialogue, where dialogue is a free flow of meaning between participants with various points of view, thought as a collective process.

And it's a collective process because the data pieces might be meaningful to different audiences, to human and to algorithmic or machine audiences. And so there's this idea of content as a portmanteau of concept and tent. It's a weave of concepts.

One nice explanation of this was given by Tim Berners-Lee, talking about a bag of chips. So, on a bag of chips, you might have the seller content, the weave of concepts from the seller, on the front. So, marketing: these chips are great. Tasty. All natural. Then you'll have the US Food and Drug Administration mandated content with that controlled vocabulary on the back. And then you would also have a Universal Product Code barcode on the bottom. So, that is controlled and that's not for humans, but it's for machines that are part of the supply chain. So, with this bag of chips, you have this plurality of content, where "sugar" on the FDA label means something very specific that might be unrelated to the use of the term "sugar" in the marketing on the front.

You can prevent these false agreements and false disagreements when you're able to use and point to globally unique and persistent identifiers in order to express those vocabularies, those contexts for content.

One nice example of a controlled vocabulary that follows this principle is the Data Catalog Vocabulary, the W3C Recommendation. For example, a Data Catalog Vocabulary (DCAT) Dataset has 29 properties, all with URIs -- 22 it inherits from Catalogued Resource and seven are specific to Dataset.

So these are very specific properties that a DCAT Dataset can have, and of course you can extend that, but these all have URIs. They're all FAIR. The DCAT controlled vocabulary is FAIR in this sense.

And, a DCAT Dataset is not exactly the same thing as a Schema.org Dataset, although Google Dataset Search will index both. So, it's nice to be able to distinguish between what you might mean by a data set by pointing to a URL for it. I would recommend you just use DCAT instead of inventing your own, but you certainly can invent something that you think is more appropriate for your context, as schema.org has done.

That'll be it for today. I'm Donny Winston, and I hope you join me again next time for Machine-Centric Science.