Machine-Centric Science | Transcript: A2. Metadata are accessible, even when the data are no longer available

April 19, 2022 • 5 Minutes

A2. Metadata are accessible, even when the data are no longer available

Data may be, or become, inaccessible by design, or on request, or by accident. While it was accessible, it may have been used by others. If someone has a reference to data by ID, can they minimally understand the nature and provenance of the data?

Hello, and welcome to Machine-Centric Science. My name is Donny Winston, and I'm here to talk about the FAIR Principles in practice for scientists who want to compound to their impacts, not their errors. Today, we're going to be talking about the 8th of the 15 FAIR principles, A2: Metadata should be accessible even when the data is no longer available.

The thing is, data sets may degrade or disappear because of costs or legal reasons. Storing metadata is generally cheaper. Also, metadata has value even if the original data is missing. It's useful for planning research, particularly replication studies, but also just for tracking down people, institutions and publications that are associated with the original data.

Data may be, or become, inaccessible by design. There may be a defined lifespan given limited financial resources. There may be legal requirements to destroy data after a certain amount of time, or on request. Data also may become inaccessible by accident. But, in either case, the data, while it was accessible, may have been used by others.

If someone has a reference to data by ID, they need to minimally understand the nature and provenance of that referenced data, even if they can't access the data itself. This hearkens back to principle F3, where metadata records should include the data identifier because if you combine this with Principal F4, so given that metadata are registered and indexed in a searchable resource, when you have the data set identifier associated with the metadata, as in F3, you can access historical metadata records even if the data is no longer accessible.

So, hopefully you see the value of other people being able to minimally understand the nature and provenance of referenced data, even if they can't access the data. Some metadata has value for planning follow-on research, replication studies, tracking down people, institutions, and publications associated with the original data.

I also want to point out here that what can be helpful as metadata is not just descriptive metadata that describes the data object, but also policy and support-commitment metadata.

The Archival Resource Key (ARK) draft specification discusses several kinds of policy metadata, including so-called persistence statements regarding data objects. You can have metadata about object availability, that is, whether and how access to the object is supported, perhaps online 24/7, offline only, et cetera. There could also be statements about content invariance. So, under what conditions the content of the object is subject to change, if at all. There could also be a change history. So, access to corrections, migrations, revisions, et cetera, either as links to changed objects or to a document summarizing change history.

One approach to persistence statements is to define and offer a set of permanence levels, like that of the US National Institutes of Health's National Library of Medicine. They have permanence levels like Not Guaranteed, Permanent: Dynamic Content, Permanent: Stable Content, Permanent: Unchanging Content. You can read more about this in the ARK specification. I'll give a link in the show notes.

So, these persistent statements can be helpful as part of machine-actionable protocols. A machine can know that the data is not available or it's not guaranteed to be available. So, if it can't access it, it might choose to not retry later, or it might log a specific warning rather than an error. So, this kind of policy metadata can be particularly valuable when the data is unavailable or it might be unavailable. It's helpful for a data consumer to know the service provider's policy about the data object and its accessibility, and to make that part of the metadata in addition to the descriptive metadata that actually describes the data object and its provenance and it's intrinsic metadata, and that stuff.

Okay, that'll be it for today. I'm Donny Winston, and I hope you join me again next time for Machine-Centric Science.