R1.3: metadata and data meet domain-relevant community standards
Hello, and welcome to Machine-Centric Science. My name is Donny Winston, and I'm here to talk about the FAIR principles in practice for scientists who want to compound to their impacts, not their errors. Today, we're going to be talking about the last of the 15 FAIR principles, R1.3: metadata and data meet domain-relevant community standards.
Where clear and obvious community standards exist, they should be followed. Datasets are easier to reuse if they use standards that are well-established, particularly in a given domain. A first approach is to ask around; ask people with whom you coauthor, people you trust in your field, et cetera.
Also, many domains have so-called "minimal information" standards. These might be a good place to start.
If you are diverting from a quote unquote obvious standard, document why and how in your metadata. This can increase trust because a search for data using the vocabulary of the recognized standard can then surface your data via your qualified references -- see principle I3 -- to the vocabulary terms of the recognized standard.
Where can you find these standards and evaluate their relevance. There are a few repositories of standards. I'll mention a couple that I like. One is the Linked Open Vocabularies repository, hosted by the Polytechnic University of Madrid -- I'll leave a link in the show notes. Another is FAIRSharing, based at the University of Oxford. Linked Open Vocabularies has approximately 800 vocabularies. FAIRSharing has approximately the same amount of terminology artifacts. It has about 1600 standards in total, going from terminology artifacts, to models and formats, to reporting guidelines, to identifier schemas.
Okay, so you found a repository of standards or you're looking around, and so how do you actually determine domain relevance. How do you sort these standards that are relevant, how do you rank them so that you know which one to use?
For the remainder of this episode, I'll go over some fundamentals of search in information retrieval, specifically relevance and ranking. So hopefully this can give you some ideas about how to evaluate these standards for relevance in your domain.
Relevance and ranking are related concepts in information retrieval. Relevance can be seen as ranking all of the content with a scoring of one or zero. In relevance, there's a trade-off between precision, that is, the fraction of the retrieved results that are relevant, and recall, the fraction of relevant results that are retrieved.
Often when you're dealing with things like Google search or e-commerce, precision is very important because people just will look at maybe the top 10 results, or maybe just the first few, and then they'll go away. However, with discovery, like what you might do to discover domain-relevant community standards for your data, with discovery, it's okay to take more than a second, so recall is most important for relevance. You want to make sure that you include all of the potential relevant results, and then you want to rank them or score them.
There are different ways to collect judgments for relevance. One is explicit human judgments, manually assigning labels. So you might have a repository that has human curation of the various standards, labeling them good or bad or not so good. You can also collect implicit behavioral judgments. These are often low-cost, where the judges are end-users, but you might have presentation bias, and you might conflate relevance with desirability. So, with e-commerce search, these are things like Click Through Rate. And so the implicit behavioral judgements for community standards would involve them being used by other datasets -- what other datasets are they being used by? So, apart from someone explicitly labeling them as good or bad, you can just infer that they are useful because, based on people's behavior, it seems to be used a lot.
There are two main classes of signals for ranking search results. One class is query-independent signals. So, these are independent of the actual thing that you're looking for. These are things like popularity, or quality or authority as measured by a PageRank algorithm, or inbound links, or external resources, or a separate machine learning system. You can see my show notes for a demo using Python of implementing PageRank on the Linked Open Vocabulary set to essentially see which of the vocabularies there are the most reputable in the PageRank sense.
Other query-independent signals include collection size -- so, perhaps anomalous sizes, vocabularies that are too big or too small, may be negative signals. Another query-independent signal is recency. You might be more interested in a newly updated vocabulary. And also there's global engagement data, as I mentioned -- for example, usage in databases.
Then, of course there are query-dependent signals, concerning the match between what you're looking for and what's out there. Some query-dependent signals include the number of matching tokens. So, if you have a search query that has five words, how many of them match for a given vocabulary. Perhaps the number of tokens matching a specific field, like the actual label of a term versus the description. You might consider synonyms. This is particularly important if your field is new or cutting-edge or there isn't a lot of stuff out there that uses your preferred terminology, but might use related terminology.
Some other query-dependent signals include token weighting. I won't get too much into these, but you can look into terms like "tf-idf", that is, term frequency times inverse document frequency. This is how search engines that are keyword-based can evaluate a query. There's also the cosine using an embedding. So, you might embed a representation of your query and your content and use cosine similarity.
One way you might use query-dependent and query-independent signals in practice is you might have a bespoke vocabulary or standard that you're using. And you might represent that as a query against a corpus of standards. And what essentially you'd be looking for is some similarity between your terminology and what's out there, and you'd want to find a good standard that has high query-dependent signal, so high relevance with respect to your actual content, but also query-independent signal -- something that seems to be highly used independently of your particular usage.
Keep in mind also that you can employ multi-phase ranking, meaning cheap scoring cascades with more expensive scoring later. This is also known as funneling.
Finally, on identifying trustworthy sources of metadata on which you can base ongoing search for domain-relevant community standards, I recommend you see the Principles of Open Scholarly Infrastructure, POSI. I will drop a link to those in the show notes. And hopefully that can help you in keeping abreast of emergent domain-relevant community standards for your data publication.
That'll be it for today. I'm Donny Winston, and I hope you join me again next time for Machine-Centric Science.