I1: (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation
Hello, and welcome to Machine-Centric Science. My name is Donny Winston, and I am here to talk about the FAIR principles in practice for scientists who want to compound their impacts, not their errors. Today, we're going to be talking about the 9th of the 15 FAIR principles, I1: metadata and data use a formal accessible, shared, and broadly applicable language for knowledge representation.
Wow, that is a mouthful. Let's try to break this down.
A language for knowledge representation. So, what does this mean?
Each computer system, as part of a network of data exchange, should at least know the other systems' data exchange formats.
What does it mean to know the formats? Well, first, there should be controlled vocabularies, or ontologies, or thesauri -- the grammar of the language and the actual tokens of the language itself, the actual words, the sentences that make sense.
And ideally all of these words and these ideas of grammar of the language will have globally unique, persistent, resolvable identifiers, GUPRIs. This hearkens back to the F principles.
So, that's one element of language for knowledge representation, but where does the knowledge come in? Well you don't just have the language. You have to know how to interpret it. You have to know how to get meaning out of the words and the combinations of terms in the vocabularies and the ontologies.
So the second component, apart from having these controlled term sets is to have models, to have metadata and data models, well-defined frameworks to describe and structure the metadata. So, just like a picture frame, with a framework, there should be little hooks to put your different controlled vocabulary terms.
Let me get into some examples of models and formats to be a little more concrete here. So, in terms of models, one really nice one that I like a lot is called the Resource Description Framework, RDF. It's a way of modeling arrangements of terms as statements of "subject predicate object", so-called triples. You can also annotate a named graph, sort of a source, I guess equivalent to a file that you might have a line of code in. And there are also models that build on top of that, like RDF Schema, RDFS, the web ontology language OWL, the shapes constraint language, SHACL. And this sort of all falls into that RDF world. Also, all of those include sets of controlled terms. So, what do each of the concepts in Web Ontology Language mean? Or Shapes Constraint Language? One of the nice things about these models is they all have globally unique, persistent IDs for all of their terms.
Formats for RDF models include JSON-LD, which is the JSON, JavaScript Object Notation, format, but adhering to RDF modeling. So, you could also have a JSON format adhering to JSON Schema modeling.
So, JSON Schema is different than, say, RDF SHACL or something like that. The important thing here is that JSON itself is not a model. You can't say, how is your data modeled? "Oh, it's JSON." JSON is a format. You need a model to be expressed through that format in JSON Schema or an RDF Schema or whatever.
So, another example of this: the ActivityPub protocol. So, people are familiar with microblogging like Twitter. There's a nice "fediverse", federated universe of microblogging and other things.
Mastodon is an example of this. So federated protocols are a great example of this kind of use. So for example, the ActivityPub protocol that underlies the messaging system of Mastodon and other ActivityPub-compliant things uses JSON as the format, and it's a JSON-LD representation of a modeling language, which is the ActivityPub vocabulary and model about actors and people signing messages and objects of messages and different activities like creating messages and sending them, and all that stuff.
So, that's an example of that in action and really any kind of federated system like ActivityPub --like email where you have different servers who host different users and you can send a message to someone else on a different server, just using their email address. Phone networks work this way. Physical mail works this way. If you have a zip code in the US and an address, you can express your address in a formal format and the US Postal Service will use image recognition to know the format it expects the address in and get it routed to the right place.
The XMPP protocol, Extensible Messaging and Presence Protocol uses this for instant messaging. WhatsApp uses XMPP, Zoom uses it, Jitsi uses it. Google and Apple use it for push notifications. So the XMPP protocol is another example of a federated protocol, like ActivityPub, like email, SMTP. It's these situations where you want people who don't a priori agree, and you don't have a centralized system, for them to be able to exchange data.
Okay. So, wrapping up a little bit. For a machine to make sense of data, it needs unambiguous content descriptors. And so I've described some modeling languages like JSON Schema or RDF Schema, but in terms of the actual terms, say you have a field, temperature, in your data set. Is that about ambient weather outside? Is it internal body temperature? You need context. You can't just have the string "temperature". Humans can gather this context from other field names present in the file, maybe from a private communication where someone handed you a USB stick or sent you an email and said, Hey, this is about this. Here's the data on this. So that's private communication, out of band. Maybe there's another document, a README document, but it's not explicitly linked to. So, machines need more help with making this explicit context and knowing, what do you mean by this temperature? What is it?
You need mechanized interpretation. And so a lot of these languages will help you do that. So, going back to the principle, let's see if we covered things.
So, metadata and data use a formal language. By formal, we mean something that's machine interpretable and doesn't require this outside context and you can interpret it mechanistically.
Accessible. It's great if you apply the F principles and all of your terms and vocabularies and the model itself is accessible, say, via HTTP URIs.
Shared. Obviously, other systems who are part of this federation of data exchange share how they expect messages to be formatted, and so people can package things in certain ways. You can format your address in the way that someone in Japan expects it using their address system.
Broadly applicable. If the language isn't broadly applicable, then it might be useful in your little domain, but if you want to do interdisciplinary work, it's going to be hard. And so you really want something that's broadly applicable to a variety of situations that you can specialize to your domain. So that's why it's not enough to have a model. What you want is a modeling language. So then you can apply that language to actually create the domain-specific models, and yet they're all interoperable because they use the same language.
And again, for knowledge representation. So, this language is for knowledge representation. It's not just for representing what is, but, but how to interpret things. So again, going back to my two points earlier for each computer system to know the others' data exchange formats: (1) You need controlled term sets, vocabularies, ontologies, thesauri, whatever you want to call it, ideally having globally unique, persistent, resolvable identifiers.
And apart from these controlled vocabularies you need actual models, well-defined frameworks, to describe and structure the metadata according to those controlled vocabularies.
All right, that'll be it for today. I'm Donny Winston, and I hope you join me again next time for Machine-Centric Science.