R1: (Meta)data are richly described with a plurality of accurate and relevant attributes
Hello, and welcome to Machine-Centric Science. My name is Donny Winston, and I'm here to talk about the FAIR Principles in practice for scientists who want to compound their impacts, not their errors. Today, we're going to be talking about the 12th of the 15 FAIR principles, R1: metadata and data are richly described with a plurality of accurate and relevant attributes.
If you've been following this series, R1 sounds a lot like F2. They're both about supplying rich metadata with a plurality of attributes. Before I get into the difference between F2 and R1, I want to talk a little bit about search.
At a high level, search works as follows. First, there's content understanding. This is about representing each piece of content in the search index.
After content understanding comes query understanding. This is representing each search query as a search intent.
Given content understanding and query understanding, you then have relevance.
The relevance of content is a function of query and content understanding. Relevance is how results get filtered. After relevance is ranking. Ranking is how one orders the relevant, retrieved content by its desirability given the understanding of the query and the understanding of context. So, again, at a high level, search goes from content understanding to query understanding to relevance to ranking.
The F2 effort in FAIR, for richly describing data and metadata with a plurality of attributes, that F2 effort is a force multiplier for content understanding.
The R1 effort, the topic of this podcast, is, rather, a force multiplier for query understanding. The R1 focus is to help a user, machine or human, determine if data is useful in a particular context.
Put another way, the F2 effort is about metadata for findability and discovery of the content.
R1 is about metadata for reuse, for context transfer based on understanding of the intent of the query.
As mentioned in my F2 podcast, the use of terms "richly" and "plurality" -- this is about being generous. This is about including even what may seem irrelevant now.
Another way of looking at this is to distinguish between framing versus validation. With a system like JSON-LD, you have this idea of framing. It's similar with the Shapes Constraint Language, where you have node shapes and property shapes, meaning you're validating what you need when you read it -- it's about schema-on-read. And that's when you validate, that's when you frame what you want.
And with framing comes reframing. So, something that might not be relevant or valid in a given frame, if you reframe it, it might be useful and valid. So again, this is known as schema-on-read versus something like schema-on-write.
With schema-on-write, it's not necessarily about validation when you're looking for many potential applications. Schema-on-write is about ensuring consistency with the domain space, and so that's why you can include lots of metadata attributes that are consistent with the domain space that don't necessarily mean the data is valid in a particular context or application. But it's still useful to include those.
R1 metadata can help assess and filter resources indexed by F2 metadata based on suitability for purpose.
The R1 metadata may include operational instructions for reuse, that is, determining, via computation perhaps, whether a retrieved resource, an accessed resource, is suitable for inclusion in a particular analysis, how to process it, et cetera.
You might encounter this principle in other contexts, say in web development, by the name progressive enhancement, or hydration.
One example of the value of rich annotation with plurality in the website-building world is the so-called "islands architecture" for web development. In the islands architecture for web development, some, perhaps most, interactive widget functionality is irrelevant. Only the JavaScript that is needed is delivered to the client, and even then perhaps it's hydrated just-in-time. So, there's this idea that you'll have several islands on the page, content islands, and some of them may benefit from progressive interactivity. So it may come as static content and maybe once that component becomes visible, some JavaScript gets enabled and all of a sudden you have this interactive widget. So essentially you're able to deliver this dehydrated metadata package to render that part of the page and that might get hydrated with additional metadata and functionality to make it interactive.
Similarly, with F2 metadata, you might find or discover a resource that's dehydrated, but good enough to determine if it's relevant to your study or not. And that metadata record may get hydrated with additional R1 metadata so that you know if it is appropriate for your use, and then you might hydrate it further to determine how to process it for a particular analysis. And maybe at that point you might actually access it -- the full data. So you might have this progressive enhancement of metadata in order to determine whether something is relevant and whether it's desirable -- how to determine its ranking.
Another technology that relates to this that I'd like to bring to your attention is the W3C community Hydra effort for hypermedia-driven web APIs. This is essentially about including affordances in metadata that help you know what to do next and how to reuse something, how to progressively enhance metadata. I feel like the principles behind Hydra are related to the relation between the R1 effort at metadata and the F2 effort at metadata where you might index your content and understand your content via F2 metadata, but there should be affordances for navigating your metadata and getting to the R1 metadata in order to determine, once you've found relevant resources, how to rank them, in order that they are desirable.
And again, bringing this back to the search process, thank you Daniel Tunkelang for writing extensively on content understanding and query understanding, and supplying this high level search framework.
Again, metadata is about force-multiplying. And so, in a search process, the F2 effort is a force multiplier for content understanding, the R1 effort is a force multiplier for query understanding. And with any kind of finding or discovery of data, and any kind of reuse or context transfer, you're going to need to combine the content understanding with the query understanding in order to determine relevance and ranking in order to ultimately reuse the data and repurpose it in a new context.
That'll be it for today. I'm Donny Winston, and I hope you join me again next time for Machine-Centric Science.