Machine-Centric Science | Transcript: F4: (Meta)data are registered or indexed in a searchable resource

March 22, 2022 • 7 Minutes

F4: (Meta)data are registered or indexed in a searchable resource

Increasing leverage: the ratio of machine action to user action. Indexing as leverage via sorting.

Hello, and welcome to Machine-Centric Science. My name is Donny Winston, and I'm here to talk about the FAIR principles in practice for scientists who want to compound their impacts, not their errors. Today, we'll be talking about the fourth principle of the FAIR principles: F4: metadata are registered or indexed in a searchable resource. Metadata or data.

The key here is leverage, leverage in the sense of mechanical advantage. The ratio of resource action to user action. I know that technically leverage is a ratio of forces in physics , but bear with me here. Action has dimensions of energy times time. Energy as work is force times distance. There's not really any distance involved here. And the user time invested, either as input or waiting for a search result, is the same as machine time invested either idling or processing. So if we kind of fuzz away the distance and time aspects, we still have this proportionality of force to action. So, let's say that the leverage we're talking about here is the ratio of resource action to user actions.

So, how much work does the user have to do over time and how is that amplified by your resource to help people find things more quickly and not have to invest so much action.

Let's make this a bit more concrete. You could have someone analyzing your data that you share with them or that encounters your paper. They could do a full data download. And it's maybe a large table and they kind of do a full table scan. If your data is on a webpage, they could scroll through a long table on a webpage looking at each row. So, that's a lot of action.

That's a lot of time spent by the user versus something that would require a lot less action to get them to, say, the row in the table that's of interest to them. Or the document in a web of documents , a bunch of webpages that are interesting to them.

So, examples of small amount of action might be just typing in a search phrase. Or selecting a value from a dropdown menu, or following a drill-down hierarchy of links. You might have a webpage that has a bunch of categories -- they click on one, they click on another one, and they eventually get to a subset of your data. Another thing would be submitting a structured query in some sort of structured query language or using an API.

And so they can get their result by having the machine do more of the work than they do. And so that's leverage and the essence of this kind of leverage that you'll often find is indexing.

The essence of indexing is sorting. So, if you have a sorted list of things, then, a machine can use, say, binary search to, given the first character of what you want, it can sort of go right to the halfway point and say: is it here? No. Is the thing that I'm looking at greater than or less than the thing that I want? And depending on which way it is, it can eliminate half of the space and recurse, and so you'll end up getting to your record in log-of-the-number-of-elements-of-your-set time, versus proportional to the total number of elements if you were to do a full scan through it. So indexing enables leverage in the act of finding things.

And there are various ways of indexing. A text-based search engine, like Google, will index the keywords of documents, or something like ElasticSearch. And so you can type in some keywords and probably narrow down the space of documents that include those keywords or not.

Indexing of fields of tables or records can be great and that's especially powerful if you have what are known as covering indexes. This is in the sense that if, say, you have a table of three fields, subject predicate and object -- so you can do this with triple stores -- you can index and sort by all arrangements of subject predicate object, predicate subject object, et cetera. And this way, what you find is always in your index. So you're never doing a scan. Now you don't always need such covering indexes for table-based systems if your leverage comes from co-locating data in a denormalized or multiple-copies way as fields in the rows of tables.

So what this means is is you might index on a couple of fields. And then the other data that you want to include is included in that row. And so when you're retrieving a whole row at a time, you kind of don't need to further index that stuff.

This approach requires a bit more planning when you're using relational data, so you kind of need to think about your use case, and decide which fields need to be indexed. And which other fields that don't need to be indexed should be included in your rows.

Again, the end goal here is leverage, is increasing the ratio of machine action to user action in getting to the data that they want. Otherwise, your data is technically findable, it's just going to require a lot of user action. They might have to do, again, a full data download, scan through a full table, scroll through a long thing , and it's unlikely that they're going to actually find what they need, because they're just not going to put in that much effort. So you really want indexing. You want this leverage to have your machine help do some of the action that a user might otherwise do.

And there are resources out there like Google for documents, Google Dataset Search, and you can also build your own, or some community resources that are domain-specific, but the idea of having it in some sort of searchable resource, where your metadata is registered and indexed really, really helps with findability. It enables people to know what they want, sort of, and not expend that much effort and time into finding the thing that they were told is in your data set, but they're not quite sure where it is.

Okay. That'll be it for today. I'm Donny Winston and I hope you join me again next time for Machine-Centric Science.