Machine-Centric Science | Transcript: F1: (Meta)data have globally unique, persistent identifiers

March 1, 2022 • 7 Minutes

F1: (Meta)data have globally unique, persistent identifiers

Today, we'll be talking about the first of the FAIR principles, F1: Metadata are assigned globally unique and persistent identifiers.

Hello, and welcome to machine centric science. My name is Donny Winston, and I'm here to talk about the fair principles in practice for scientists who want to compound their impacts, not their errors. Today, we'll be talking about the first of the fair principles F1. Metadata are assigned globally unique and persistent identifiers.

What does this mean? This means that every single concept in your metadata, every single concept in your data records-- column names, field names, whatever you want to call it, however your data is represented on disk-- needs to have an identifier that is globally unique and persistent.

This is the key to findability, because it makes sure that when people are talking about something they're talking about the same thing. They can point to it. In a typical natural language dictionary, you might have a word. If you want to reference a particular definition, You need to reference the particular dictionary.

Maybe the edition, maybe the printing of that dictionary, whether it's online or printed. And then even when you get to the actual word, there are multiple definitions. Are you referring to the noun sense? The verb, the adverb, and even if something's a noun there might be multiple definitions. So how can you point to the actual term that you're using? And so this is the role of an identifier.

So why is it important that it be globally unique? Well, it's important because someone shouldn't be able to accidentally use the same thing as you and refer to the same identifier. So this is why it's super helpful to use HTTP URLs as your IDs, because they're going to be globally unique because no one else can have the domain name that you're using for your ID. And if someone else were to use it then someone, a consumer of that other person's data might enter that URL in the browser and they would find your definition, not whatever they want. So that's why it's a very nice property of HTTP resolvable URLs to ensure global uniqueness.

The other thing is persistence. So what does persistence mean? It means that it's going to continue to be there all the time. Now, persistence is a matter of service. It's not a matter of just naming something. It takes time and money to ensure persistence.

So what does this look like in practice? It looks like deferring to a service organization to host something such as orcid.org to host an identifier representing a person on earth. Doi.org to host digital object identifiers uniprot.org to host protein identifiers, that sort of thing. Because these entities are funded. And then they should be resolvable. The other wrinkle here is that because persistence is a matter of service, it might be the case that you can switch providers, and that's an important consideration.

I personally like the scheme of the archival resource key --ARK-- identifier scheme because the naming authority is decoupled with the mapping authority. So you might be assigned a name assigning authority number, and you can mint your own IDs under that namespace.

And separately from that, you can say, I want myprovider.org to be the thing that a meta-resolver points to, to resolve my IDs. And later on, I might want another organization. So the way this manifests with meta-resolvers, like identifiers.org, or n2t.net --name-to-thing dot net-- is you'll have a prefix registered for that type of ID and then it resolves to the service. So for example, you can go to identifiers.org slash DOI colon and then enter your DOI, and it will redirect your request to doi.org. You can also have identifiers.org or n2t.net slash ORCID colon and then give the orcid number, and that will redirect to orcid.org because orcid.org is currently the resolver that ensures persistence of service for resolving that identifier. But the global uniqueness is ensured as long as you kind of have that registered namespace that no one else clobbers on, and that ultimately is a URL. But what's nice about identifier systems like DOI and ORCID and other such things, UNIPROT, is there are these common registered prefixes. So just in case orcid.org, doi.org, uniprot.org are down or you want to have a local cache of identifiers, you can still unambiguously resolve them to their definitions without having to depend on the uptime of a canonical service -- they're still globally unique IDs and the persistence is guaranteed by the service.

And again, this is very important for findability because this is how you can unify datasets. So rather than having to worry about foreign keys and what you can join on in order to produce an observation vector for machine learning, some join table, you can unify on these persistent IDs. So if one dataset refers to one of these globally unique persistent IDs and another dataset that's produced by someone else refers to the same ID, you can join on that key, whether the ID is the subject of one data statement or the object of another. You can merge data unambiguously based on that without having to know how to interpret some foreign key relation-- it's unambiguous for that reason. And so findability is one great thing and data integration is another great thing that globally unique and persistent identifiers enable. And there are lots of services that allow you to do this. You can also roll your own.

One thing I like about ARKs is it's decentralized so that you can again name as many as you want, and then worry later about how to persist them and then dispatch to a service.

In summary, F1 is super important for the fair principles. It is nice that it's the first one. It works out mnemonically that F is the first letter of the acronym FAIR, but it really is important that F be the first thing, and metadata and data with global, unique, and persistent identifiers are the key to data unification, resource unification, integrative analysis, and people being able to know that they're saying statements, they're making assertions, about the same thing over time.

That'll be it for today. I'm Donny Winston and I hope you join me again next time for Machine-Centric Science.