Machine-Centric Science | Transcript: Vineeth Venugopal

October 31, 2022 • 59 Minutes

Vineeth Venugopal

Vineeth is a materials scientist working on creating a knowledge graph of materials. He is new to ontologies and the semantic web in general; he'd like to understand ontologies/taxonomies and what an ontologist/taxonomist does in general. I've agreed to let him barrage me with questions until hopefully some clarity is reached.

[00:00:00] Hello and welcome to Machine Centric Science. My name is Donny Winston, and I'm here to talk about the FAIR principles in practice for scientists who want to compound their impacts, not their errors. Today we're joined by special guest Vineeth Venugopal.

Vineeth is a materials scientist at MIT in Cambridge, Massachusetts, United States, working on creating a knowledge graph of materials. He is new to ontologies and the semantic web in general, he'd like to understand ontologies and taxonomies and what an ontologist taxonomist does in general. I've agreed to let him barrage me with questions until hopefully some clarity is reached.

Welcome, Vineeth

thank you Donny thanks for inviting me.

Great. Um, Vineeth could you please, uh, first introduce a little bit about yourself to our listeners and, uh, introduce also this context you're bringing your interest in, creating a knowledge graph of materials?

Yeah, absolutely. So I'm a materialist material scientist, as you said. Um, my undergrad was in ceramic engineering and then my PhD was [00:01:00] on piezoelectrics. So I used to be working in a lab, fabricating materials, characterizing them, testing them. And then towards the end of my PhD my interest branched towards artificial intelligence and material science. And one of the questions that I've always been very interested in is how data is organized in the field.

Uh, because as you know very well, the main drawback in material science, and especially the big challenge in applying artificial intelligence is the absence of databases, specifically machine readable databases, and that's always been a challenge for the field. And what we do here in our lab at MIT is to apply natural language processing to automatically extract data from large quantities of text.

So specifically, uh, parsing based methods, uh, from scientific publishers. . And [00:02:00] something that has always been very surprising to me is that there isn't a single one stop database of experimental, uh, material science information anywhere. So if I say, can you give me the list of all materials that are piezoelectric or what are the properties of, uh, lithium cobalt titanate, uh, there is usually no one place where you can get this information. We either need to go to a textbook, or most likely you will end up talking to a specialist in the field who has worked on the topic for many years. And this is especially glaring because today I can Google almost everything and then have Google tell me the answer right away.

And it's surprising to me that despite like hundreds of years of experimental background and people working on it and people understanding how important the field is, we have not reached a point where all of this information is sort of aggregated into, [00:03:00] uh, one e easily accessible place where you can do highly nuanced and uh, focused query.

Um, so that's, that was my interest in, uh, coming to, uh, in developing a knowledge graph because it seemed like the most obvious, uh, since Google is powered by a knowledge graph, it seemed like something that, um, the field could benefit from. And that's sort of how my interest moved into creating a knowledge graph. Um, and in, you know, seeing how the community can benefit from it.

Uh, but at the same time, so I sort of like came to knowledge graph first and then discovered that there's this whole field behind it that are ontologists who specialize in it, that are people who work specifically in semantics like yourself. And it's, it's like discovering this whole other field, uh, after having learned first about the knowledge graph, which is, which I think is like the [00:04:00] practical application that I'm very interested in. So I. I, I guess my question would be if I were to, I, I can speak about how I went about creating the knowledge graph, but I would also like to know what is a standard process by which someone would approach the problem?

Like someone like yourself, how would you go about creating a knowledge graph? And then it seems like, uh, an ontology is highly significant for a knowledge graph. So then what is the best practice for creating an ontology? And is it's because from what I have read, it seems, uh, like a subjective process that's more an art than a science.

And I wonder if that's still holds true for the sciences. Um, because yeah, maybe like I should stop here and. We can come back to the questions later.

Uh, sure. Great. Yeah. Thank you for that introduction. Uh, Vineeth and yeah, lots of great, uh, [00:05:00] great things to, to, to pick apart there. Um, so you are interested if there's some kind of standard approach. And I realize, you know, I, I come from a, um, quantitative physical science background, , uh, myself, and it's often nice to, uh, you know, have quantitative methods versus, versus qualitative methods or, or standard ways of doing things that can then be adapted. So I definitely understand that, that, uh, that, um, Interest.

Uh, also I will say, and why I'm particular interested in, in talking with you is, uh, I've, uh, I started off doing experimental work myself. I was doing nano fabrication, um, at MIT actually. Um, uh, in that field. I worked with material scientists, um, and then I later was at Lawrence Berkeley lab in California with the materials project for a while, uh, because I had gotten into, uh, web development and I, I really liked like software, um, and that sort of thing. And so I, I got really in deep, deeply into that world.

But even then, um, I was not doing [00:06:00] knowledge graph stuff or semantic stuff. Um, I was mainly embedded in a, in a world that I inherited and, and embraced, um, and, and still get a lot of mileage out today, which is, um, uh, document orientation where, where schema is, uh, uh, a bit, uh, more fluid, less strict. Um, so, uh, Materials project still uses MongoDB as a JSON database.

Um, and, uh, not necessarily using JSON LD or, or RDF or that, that kind of stuff. Um, but I, I definitely saw some. Uh, experienced some hardship, um, about organizing that, that data. And I knew there, there could be something to do better, um, to do it, do it better. Um, but I didn't quite have, have, uh, handle on it.

Um, then I, I discovered, uh, the fair principles and that, that that paper, um, and that was a rabbit hole, the community of, of RDF and Symantec web stuff. And since then I've, I've, I've really embraced it a lot. So [00:07:00] just coming from that journey, I feel like, you know, I'm glad I can, uh, I can speak with you on this.

Um, so, uh, I wanna acknowledge first of all that yeah, there does seem to be an absence of a one stop shot for machine readable data, particularly experimental data. Um, I was involved with an effort with a lot of computational databases, the OPTIMADE effort in Europe to, you know, uh, get together a lot of, uh, computational databases and, and standardize on some things.

Um, and even then, I think we've had limited success with that. We've, we've had, um, some, some, uh, agreement on some base API um, terms and, and, and, and protocol. Um, but we haven't really gotten too much into, into the semantics and getting things really formally done there. There's interest in that. Um, but, uh, so even, even that's a bit lacking.

Um, one thing I'll say in terms of, of one stop shop, I, I think you, you hit on something, uh, Uh, poignant with, with regard to Google. So Google, [00:08:00] uh, for a lot of its purposes happens to be a one stop shop for a lot of things. Um, but a as you, as you remarked, it's powered by the knowledge graph. And fundamentally, uh, Google is a, an, an indexer a search engine.

So, so, so Google doesn't necessarily have everything, although it tries to cache a lot, but it really is continuously crawling the web, um, the web of, of documents most, most notoriously where it will, it will ingest a web page and see all of the links on that web page and then visit those links and, um, explore this web of, of documents.

So, so, so not a web of data sets, um, but a web of, of, of HTML documents that, that have links in. Um, and it's, it's through this analysis and this power that, that it's, it's been able to, to do such, such great things, uh, even before it had its knowledge graph. It was, uh, a lot of its ranking was, was based on the Page Rank algorithm, which is a, a graph analysis algorithm, um, to see, okay, what, what pages linked to [00:09:00] these pages and, and, uh, uh, the pages that link to it are, are they also linked to a lot and all this stuff?

Um, but as you mentioned most recently, um, about 10 years ago now, um, uh, in large part. Um, buying a, a large semantic, uh, web graph database company, um, Google, uh, sort of jump started and like really dived into its knowledge graph efforts and, and, and, uh, and, uh, and announced that, and you can see it all the time with, with, with their search.

So rather than just this, uh, uh, a text search, uh, you'll often have, uh, information assembled on the fly, and you'll have these side panels, uh, uh, with, with regard to Google knowledge or results, or you'll have cards. You might ask a question, you'll get like a, a Q&A card. And that's in part because a lot of people in order to increase their rankings, have begun to use semantic technologies and link data.

A lot of pages now will use, uh, the so-called schema.org markup language. And in a lot of these pages, um, you'll, you'll look at the source code of the page and you'll see it [00:10:00] has, uh, JSON-LD schema.org. Um, and this, this helps Google produce those, those rich snippets. Um, uh, and then in the project that I was on, materials project, uh, uh, a lot of the DOI landing pages for the materials have embedded schema.org markup. And so that, that's how it gets indexed by Google Dataset search for, for example. So that's, you know, a one stop shop, so to speak, for a lot of those things. But I, I guess what I wanna emphasize is that, um, uh, the one stop shop is, is, is, is kind of a matter of service. Like someone could, can decide to do that or not.

But what, what's fundamentally made Google so powerful and what makes, uh, semantic technology so powerful is that, The, the, the expression, the representations and the servings of, of those representations can be completely distributed. Um, and, and they can be collected and indexed.

So what, what you were saying about a one stop shop about, uh, machine readable databases. It, it could be the case with, [00:11:00] with semantic technology and web technology, rather than having all of these people have to submit their data sets to this one portal, uh, and, and like, have that be the central place. Um, a, a central portal entity could act like Google in a sense, and reach out and index what other people are serving up and, and maybe caching them.

So I think that that's, that's one thing, um, to note about that. Uh, getting, getting to your, your question now about a standard approach to doing this, this kind of thing. Um, there are a few different points.

Can I ask you, uh, so, so. Does that mean that the reason why Semantic technology exists today is to enable that indexing in a sense the reason is the reason why schemas are proliferating and the reason why we pay attention to that language is that, so that [00:12:00] our data can be accessed by these indexes and thereby reach a wider audience.

Yes. I, I would say that's, that's a big motivating factor. So with the schema.org vocabulary in particular, um, it, it, uh, has been heavily adopted by lots of people because they will get better rankings in search engines where people go to search for things. And so, uh, it, it can, it can mean. Thousands, millions of dollars, um, for, for companies to, um, semantically mark up their, their content properly. Um, so that, that Google can, can, can know what it is in a way that's, that they really can't quite do with. Um, honestly, with, with NLP you mentioned natural language processing. I mean, Google can and certainly does, um, ingest full webpages and, uh, they can do named identity extraction on, on the text in those pages and try to [00:13:00] understand what this page is is about.

Um, but if authors of those webpages give explicit metadata, um, in the form of, of of schema.org typed documents to Google, then, then Google, uh, can more unambiguously, um, know. What is meant.

Now, it's a separate issue of, of whether they decide to believe what the author supplied. Um, because, you know, lots of people try to try to game search results and, and all of that, but at least, um, they, they know what's being said and they can be, they can say like, Okay, this is what's being said.

Is it true or not? versus what's being said? Is something being said here? So, so that's, that's the big, the big role of, of, of semantics there. Yep.

As a follow up, uh, has this been standardized or where is it, because I'm guessing that there are multiple indexes. Google is probably the number one. Mm-hmm. is, has the field sort of [00:14:00] standardized or are they competing, um, standards and how is that playing out now?

Sure. Uh, so, uh, I mean, from my perspective, the field has standardized and I, and I can explain in which ways, um, there, there are, um, People using graph databases, um, that don't use, uh, the standards that I'll mention. Essentially, the, the, the worldwide web consortium RDF stack, uh, um, uh, Neo4J as as a company has been successful in so-called labeled property graphs and, and that sort of thing.

But it's, it's not quite a standard. They wanna standardize it and, and merge it with RDF. But, but what I'll say about, um, in terms of the, the semantic web stuff that has been standardized, um, and, uh, so here's how that, that stack works.

So, uh, there's, there's the World Wide Web, um, and, and there there are standards for that. And in particular, some salient standards [00:15:00] relevant to, to link data are the, the HTML uh, content encoding standard, um, for documents. So, so, so a browser will know what it, it's getting when it gets HMTL. And the other thing is, is, uh, the protocol HTTP hypertext transfer protocol, which is built on the more general internet protocols, you know, transmission control protocol, internet protocol, TCP/IP, but sort of the web is HTTP, hypertext transfer protocol, and the ability to, um, uh, have, a Uniform Resource Identifier, URI,. So, so that's a standard. Um, and URIs have, have schemes. And so the HTTP scheme is, is, is sort of a big scheme. You, you'll often see other links to other schemes like, like FTP or if you click on an email address on a webpage, it'll use the "mailto:" scheme, and that will, that will open up your, your email client.

Um, but so, so that, that's, that's the, the Web. Um, and the part of that that's really standardized, again with the HTML [00:16:00] part was sort of documents. Uh, there's, there's definitely a very standard way to, uh, encode documents of information.

Um, but HTTP is, is is more general than that. It can, uh, transfer, uh, representations of any types. So you can have a link not just to an HTML page, but to a, uh, JPEG image, a PNG image, uh, you know, a, a DOCX Word document XML file, all that stuff. Um, so, so what the, what the Semantic Web of, of, of data adds on top of that are, uh, are. Number one, a, a representation model for not just, not documents like, like HTML, which is a form of XML Extensible Markup Language, um, but also of data in the form of, of assertions.

And this is called RDF, it's the Resource Description Framework. Um, and, and there they, they're, they're, uh, there's a W3C standard, uh, on it. Um, and it, it's a bit abstract. It's, it's an abstract model. Um, [00:17:00] but, but it does build directly on the Web. So it says, in order to make statements about things, um, those things have to have URIs, right?

So, so it's already building on, on top of HTTP. Um, and then apart from that, there are some actual serializations that are standardized. Just like HTML is a standard document format. There are standard serializations for. RDF um, some, some common ones are, um, Turtle, which is a terse triple format. Um, it's, I, I find it convenient to, to read and write.

Um, there's also JSON-LD, which is JSON, um, but it, uh, it encodes RDF. Um, and, and the key to that with JSON-LD is you'll have this one, uh, privileged property. There are few privileged properties, but, but one is called "@context" that that key. And, uh, what that will do is that will tell you how to interpret a lot of the fields in the JSON.

So you might have, um, a JSON key, um, [00:18:00] uh, called uh, um, "dielectric_constant" or something, and, and that, that would, that would have like some, some numerical value. But like, uh, what the @context will do is it will enable you to put that field name as the suffix of a URL prefix so that you can go to HTTPS:// blah blah blah blah Slash (/) or, or Hash (#), uh, "dielectric_constant" and that identifies that property and will say things like, Well, this thing has a range of an, of a number.

Um, uh, And you can, you can start to get into all sorts of things. So you can say not only is does it the range of his number, but the, the, the type of, of, of, of this, of this, this numerical quantity is, um, uh, it's a of, of an energy quantity kind. And maybe by default it's expressed in eV, so you can, you can get all of these things, uh, on top of existing JSON infrastructure, for example, by doing that.

Um, so, so those would be like [00:19:00] the main standards, uh, stack to like start looking into would be this, this RDF data model. Um, and serializations like JSON-LD and, uh, the ways of, of representing, um, Valid statements you can make. So, so RDF would be an abstract model. Uh, these serializations like Turtle and, and, and, and JSON-LD would be about syntax. Like, like how do you, how, how do I, can I actually see these, these triples manifest.

Um, in terms of the semantics of, of, of, or the, the grammar of like, well, this is not a correct statement to say in this context. Then you get into ontologies. Um, specifically the standard around this is, uh, OWL, it's, it's called web ontology language. Um, but it's, I don't know, it's more fun to say OWL than WOL. So it's called OWL. Um, and this is a language for [00:20:00] specifying, uh, grammars. Uh, so, so, so, uh, one thing for example, um, there's this, uh, Ontology called the Simple Knowledge Organization System, which is also standard called, called SKOS.

Um, and that is, uh, an ontology that was, uh, built. It's, it's described using OWL um, and it helps you to, uh, construct a taxonomy . So a taxonomy would be, um, sort of a, a, a subset of a less powerful version of, of, of ontologies that, uh, restricts you to say certain things that are about hierarchical organization.

So the only things, uh, you can say to relate entities in a SKOS taxonomy are things like, Well, this concept is broader than this other concept, or, This concept is narrower than this concept. You might, otherwise, if you want to say that [00:21:00] there's, they're semantically related in some way, the closest you can get is to say that they're related, skos:related.

So that's, that's the grammar that you have available to you. Um, with more, more broad ontologies, you can have a, a richer set of predicates and define exactly what they mean. Like, um, has unit or , uh, uh, was created by, um, that, that sort of thing. Um, yeah. Uh, a couple other, um, ontologies, controlled vocabularies, that I, I think are, are, are quite relevant here uh, for, for the kinds of stuff that we do is number one, uh, PROV it's, it's, it's PROV. It's, it's the provenance ontology. And this, there's also a standard for this. And, uh, this is how you coordinate things like, um, like agents and their activities and entities, agents being, uh, an abstraction for, for people, that sort of thing.

So, so getting back to some of the things you talked about in the lab, you might describe the provenance of some, some material being made, [00:22:00] uh, through activities like fabrication, characterization, testing. The agents involved would be things like Vineeth, um, This particular, uh, uh, spinner or, or, or or ALD system, um, that, that, that would be an agent that would execute a process and would have some entity as input and output.

So, so PROV is one way of like, of, of, of, of organizing the vocabulary of, of, of how you do that in a standardized way. Um, and the final thing I, I'll, I I'll bring you to for ontologies, uh, standards would be, um, the, the Dublin Core, um, is, is, is, is a quite widely used, um, uh, concept scheme. Um, you, you'll typically find this if you want to resolve metadata for DOIs.

Um, so it, it's very, very apt for bibliographic, uh, metadata. So very general things like who's the creator of this? Um, does it conform to some type? Um, what's the license on it? Um, when was it last updated? Uh, [00:23:00] things like that. So you'll often, uh, see DOI metadata returned in that, that Dublin Core format.

Um, and then the final thing, which we mentioned, but is, is doesn't quite, um, follow the, the OWL um, rules. Um, it has, it has sort of a different purpose and, and I don't think, we'll, we'll get into this, but, but there, there's this idea of, of an open world assumption versus, versus closed world. And so, so schema.org, um, is, is very popular, but it, um, it's, it's very well documented, but it's, it's not quite the same as, um, as, as, as OWL it doesn't have have those same kinds of semantics.

And a lot of that has to do with, um, again, this idea of, of open world versus closed world. And, and that can be, be difficult to, to get. One's head around a bit. But what I wanna say for that, I think it, it's very important and useful. Uh, so OWL makes this open world assumption, um, which, which essentially, uh, means, um, rather [00:24:00] than, than validating, uh, whether you have a field that you, that you think you need to have for a certain dataset, um, it will only, uh, insist on telling you whether you violated some logical constraints.

So, uh, whereas if you're looking for like a valid complete data set in order to say, feed into a machine learning model, and you want there to be a value for this column all the time, that's where you'll want more. You, you'll want to, to close the world. And you're like, Okay, in my world, in this database, like things, things have to be true.

And there you'll get into things like, um, a recent standard for that that's been widespread is called SHACL, um, Shapes, uh, Constraint Language. Um, and, and that's, that's more similar to things you'll see like, like JSON Schema or, or, or, or SQL in other places, um, where, uh, you'll say, Okay, this field is required. You know, this field is required. It has to be an integer, that sort of thing. Whereas OWL will, will, will say things more like, [00:25:00] um, like, uh, a material, uh, needs to have, uh, uh, a, a periodic crystal structure needs, needs to have a unit cell maybe, may, may, it could, could posit that or not.

Um, if, if you're validating that with, with, with SHACL, with a closed system, you'll get a validation error that's saying like, Hey, this material doesn't have. Uh, a unit cell, you, you haven't included that in this record. Whereas OWL won't, won't give, give an error on that. Um, but if you, if you give it a unit cell that doesn't have the right, uh, type, then then it will complain about that because this is, it allows for the fact that you might not have all of your data yet.

so it's very general and it alerts you to things like, well, this, this is, this violates like how you have told me that the world works. You know, how you've just logically described your domain. Like, this isn't gonna, this is [00:26:00] never gonna be valid. You can't ever make this valid. This is, this violates, you know, how you've axiomatically described, uh, your field and domain of knowledge to me.

Uh, whereas, whereas SHACL is more like, Okay, now we want to close the world because we wanna feed this nice tidy data, and we don't wanna have any missing columns, into that. Um, uh, but again, yeah, so that was, that was very longwinded. Um, but uh, yeah, there, there, there's there definitely standards. And, and, and furthermore, um, there are lots of, um, Options for implementation.

And so, so the way that the W3C standardization process works is, um, for something to really be, be standardized, there have to be at least, uh, three, um, candidate implementations. So, so you have that, it doesn't, doesn't necessarily have to have to persist. Um, but, but what I'll say is, is for example, um, there are a number of, uh, so-called triple stores, RDF graph databases.

Um, they're, um, uh, there are open source ones. They're, uh, [00:27:00] you know, uh, Wikidata currently uses, um, uh, Blazegraph, which is discontinued, but it, it's considering four other ones right now, because again, you'd have multiple choices for how you store these graphs and, and query those graphs and, and the query language is, It, it's called SPARQL and it, it's another, it's another standard.

It's sort of like the analogous to, to SQL um, in the, in the relational database world. Um, but, but, but again, you have various options for, for, for triple stores. Um, and, and that's, that's nice. So, so it really has, there really is an ecosystem. This wasn't always the case.

Um, sort of the fire was lit for, for the development of Semantic Web standards. Uh, in, um, 1999, 2000 there, there was a seminal article in 2001 in Scientific American, um, by, by by Tim Berners-Lee, the creator of the Web, and, um, and, Ora Lassilla, who, who's, who's, uh, who's now at Amazon Neptune. He's, he's done a lot of work. And also, uh, Jim Hendler, um, who, who's at, uh, who's at [00:28:00] Rensselaer and has just done, done a lot of great work. So I think there's a huge catalysis of interest in 2001 with that. Um, and lots of standardization. And it wasn't until sort of like the latest thing that's been significant is this closed world standard SHACL, which was, uh, standardized in 2017.

Um, but like SPARQL was, was standardized in 2013, so it's been a couple of decades and, and like a, a rocky, rocky start. Um, and, uh, oh, and right, that, seminal article. I think it's called Semantic Web , but it, but it's, it's in Scientific American, so it's, it's a popular press article and it just, um, Fantastic. Very, um, Very, very broad overview, um, of, of the vision, vision of that.

And so, uh, I think it sort of went, went through like a hype cycle. People were, were very, very excited about it. And maybe there were inflated expectations, particularly in the, in the late aughts, um, uh, maybe beyond that. Um, not, not quite the, [00:29:00] the same dive that nose dive that, uh, symbolic AI took in the late eighties, early nineties with so-called AI winter, which then, then picked up, uh, uh, uh, a lot with, with the advent of, of, of practical neural networks and machine learning in the, in the 2010s and above.

Um, so there's, there's slowly been gaining steam and it's definitely used, uh, quite a lot. Again, they're standards. Um, and uh, Yeah. Yeah. So, so that, that I, I would, I would recommend that, and there, there are various ways of plugging into that ecosystem and, and luckily, uh, there, there does seem to be a lot of adherence to, to, uh, the standards, um, for a lot of these vendors.

Um, unlike in, in the SQL world, you know, sometimes you'll nominally have people, um, adhering to the ANSI standard, but, but really like, you know, MySQL is kind of different from Postgres and it's kind of different from SQLite. Um, I think you, you have a bit more discipline in the, in the RDF world about, well, this needs to be SPARQL 1.1 compliant, this, this, this needs to, need to do that sort of thing.[00:30:00]

Um, so, uh, yeah, that, that's sort of what, what the landscape looks for that. Um,

Okay. That was very helpful.

Yeah. Thanks. Uh, I wanna get back to, to one thing you, you were talking about earlier about like sort of a standard approach. Um, this, uh, there, there are various ways, one. One approach, uh, that, that I, I, I like that I, that I a breakdown rather, um, that I like in terms of what goes into creating a knowledge graph is, is this characterization of there being three different loops, so to speak, in, in building a knowledge graph.

One was, was labeled, uh, the so-called expert loop, uh, and one is a user loop and one is an automation loop. And they're, they're kind of characterized by, by the kinds of, of roles, um, that, that are involved in, in doing this development, um, at a very small lab that there might be one person wearing multiple hats, , [00:31:00] but a lot of times there might be different people.

But the expert loop, um, is essentially about getting people who are, who are knowledgeable about, um, ontologies. They're often called knowledge engineers. Um, who can, who can help create craft these ontologies in consultation with, with what are often called, uh, SMEs, subject matter experts, domain experts. So like, uh, in your case, you know, you, you might be, um, I mean, you are subject matter expert.

Um, and, and there might be someone you reach out to may, maybe, maybe in CSAIL, maybe in the CS department or, or some, There might be someone who's, who's not so knowledgeable about material science, but they really know ontologies and, and they work with people in the life sciences. They might work with people on, on, uh, on Wall Street or in banking.

Um, and banking. Banking has, has a lot of, uh, linked data as well. Um, and, but, but, but, and so they'll, they'll work on that loop of [00:32:00] formalizing the knowledge of the domain into a lightweight ontology. So that, that, that's one thing. But, uh, you're, you're not done when you have that, uh, That stuff. You also need, need the data to populate.

So the other part is this automation loop, so called, and that's where you'll, you'll get, um, you know, people like data engineers or people who, who, who, who, uh, who know how to Extract-Transform-Load data. And there's a lot of automation there in terms of, um, transforming raw data that you, that you have in, um, in spreadsheets or in in, in documents, in PDF articles.

You know, how do you take an article abstract to RDF ? I mean, it's gonna involve NLP and, but, but, but that's still all of the automation loop that isn't about the experts talking to each other and coming up with a conceptualization of the domain, which, which is a loop, um, which, you know, can loop as, as more data comes in.

But, so then there's the automation loop. Um, and the final, uh, loop, uh, would be the user loop. And so this is you, you'll [00:33:00] have, have end users who are actually interacting with the system. Um, in, in your case, I, I imagine it would be, uh, people. In the field, um, uh, and, and maybe application engineers who would have a frontend interface and would capture user intent.

Um, things like, you know, if there's a search bar, what do they click and what do they type in? And that can help expose some discrepancies between, uh, what subject matter experts think the domain is versus, you know, what's often called a folksonomy, uh, rather than as in a taxonomy, uh, which is what people actually using it, like the synonyms they use in order, like what their mental model is.

And so that's, a lot of times you, you'll have that, those three things going in parallel, this user loop, expert loop, um, and automation loop. Um, other than that, I, I, I would, I would generally say the approach is, is to, I, I've heard that the quote said to think big and start small. Um, so, so you, you want, uh, to, to craft [00:34:00] your domain model, um, like an ontology, um, with, uh, extension in mind.

Um, so this is, this is where the open world assumption, uh, come, comes into play. But don't worry about, Another phrase I've heard is, is don't boil the ocean. Um, meaning you don't have to map out your entire domain. So in the case of, of materials you might be interested in, um, you know, characterizing piezoelectric properties.

Um, you, you don't, uh, have to have a, a nice domain ontology for, uh, you know, tensile strength or, or, or other things to the extent that, that, that, that they matter or don't matter. You don't have to characterize all of material science and like this is material science. Um, uh, to do that you kind of can just have part of your domain and then populate that with relevant data.

Um, and, and the, the idea from there is because it's an open system, [00:35:00] um, that we're where you use URIs to connect things you can always add to the graph. Um, so, you know, if you have a closed system, um, like, like, like a traditional relational database table, you know, you might have like a, a table of materials and there are four columns, right?

And, and if, if you, if you want to discover something new, then you need to add a fifth column and migrate the whole database. And maybe it has a none value for whole, you know, But like, you can't just add things. You kind of have to like reformat you, you, you have to, uh, break down a wall of the closed world and, and, and, and, and and make a new wall.

Uh, whereas with, with this technology, you, you can start small and just keep adding things. Um, and so the idea is, is to, uh, have enduring connections. Without the necessity of endless transformation. So, [00:36:00] so you, you could, you can, you can go and, uh, and, and do your processing, uh, when you want to. Uh, so that, that was a bit of a roundabout way, but I, I, I hope I, I, I mentioned, um, just maybe the different personalities involved and, and how they might, um, You know, do some of that stuff.

Could you, so from what you said, what I understand is that it's easier and more tractable if you approach it in sections. So maybe say, if I want to make a knowledge graph of materials, I would probably focus on one sub domain, create the, the, the domain, uh, langu model for that, and then add the next one and so on.

And maybe it might even be a community effort with different people coming in. Um, did I get that right?

Yeah. That, that, that's correct. And, and the, I guess, um, if you're familiar with, you, you familiar with, with object oriented programming? Right. Oh, oh, oh. So, so like, like a lot of, like the, the, the modeling is, is [00:37:00] reminiscent of that in terms of, of, of inheritance and, and plugging in.

So, um, again, you can start small with, with, with something and if there's something you, you're not, Quite gonna get to, if you have, you know, a, a class that someone can subclass for that, then it just ensures that that what you later or someone else can, can sort of plug into that, into that model. Um, but, but, but that's exactly right.

You, you start with something small and you do so with, with the discipline. Um, so that extension can be, um, unprompted. There's Cory Doctorow who, who's done a lot of work with electronic freedom and stuff. This is just a phrase that stuck with me. Um, he has this, um, phrase called adversarial interoperability

Um, and it, it sounds a little scary, but, but, but I mean, the idea is that, um, it's, it's for interoperability to like really work. Um, [00:38:00] It's, it's, it's a good idea to build systems that someone can, else can plug into without your permission, essentially. Um, because the protocol is open versus you've developed a little, little graph and, and, and you've, you've published it.

And, and someone, someone has to, um, contact the author to like actually understand and, and, and like, you know, they have to submit a pull request to your repository and like, nothing's gonna get in unless, unless, unless you do it. And so one of the big benefits of the semantic technology, of the Web, of URI, the semantic web, is that you can provide these open protocols where people can plug in much, much like the document web, right?

Uh, I don't, you know, need to ask someone if I can link to their web page. Um, and, and depending on Google indexing, I might become a more authoritative source than, than someone else and be, be a top link and Google just needs to index all of the links. Um, so [00:39:00] it it's a way of of, of creating the basis for an ecosystem of, of, of knowledge to grow in a machine actionable way, in a way that, that, um, you know, the current narrative-centric arc of science has kind of, kind of outgrown, uh, it's impossible to read all these papers coming out. Um, and to curate them.

So Donny, my concern with the, with that approach is that a field, like material science is so vast and so diverse. Mm-hmm. , that it's quite possible that we will never get to a place where the whole of the field has been modeled in that way. You know, so, um, there might be that are, someone might discover a new property and that might be a whole area of research and mm-hmm , we may need to wait for an expert to do the data modeling and add that to the model that we have created.

Um, so in such a system, it seems like the part [00:40:00] by part approach, maybe it might slow the whole field down and. Uh, we may never get to a place where we could answer the sort of questions that I sort of raised at the beginning. It's like if I give you a property and say, List me list all the materials that have that particular property, or I give a material and say, Can you list all the properties that this material has?

So how do you, is that a valid problem? And if so, what is the data modeling approach for that?

Sure. Yeah, no, that's, it's good to, to clarify. Um, so what these standards allow is, uh, decentralized progress, um, that, that is able to, to, to hook up with each other in, in qualified ways. So, so imagine the, the web of documents, um, [00:41:00] you know, there, there are various places that, that have documents about. You know, various things. Wikipedia has a lot. Um, there, you know, MIT has a bunch of open courseware there are lots of, um, authorities, um, who, who, you know, literally have, have reserved different, um, authority tokens in the, URI scheme of HTTP like mit.edu or wikipedia.org. So, so they have that authority through the Domain Name System to assert whatever they want about things.

Um, and then they, they have the option to choose to link to other things. So, so, so, you know, someone at MIT might have a webpage that links to something on the berkeley.edu domain because it thinks that something someone's doing at Berkeley is, is, is, uh, uh, worth investigating in a qualified way, but currently they can only do that [00:42:00] with, with a link.

And you're not, you're not quite sure like what, what the relation is between the current page and that other page. It's, it's not a, not a typed link. Um, what, what this technology, what Linked Data, what the triples allow you to do is have these typed links. Um, so, so you might have a, a graph of your, of your part of the field, um, and, and someone else independently develops a graph, uh, that's, that characterizes their field.

Um, and there are no clashes because, uh, we're using, we have things, URIs, not strings, so we don't have to worry about what you call, uh, you know, conductivity being the same or different as what someone else calls conductivity, because that, that term is actually a URI. Um, so in your case, it might be, I mean, you might use a DOI or something, but let's say it's hosted directly at mit, it'd be mit.edu/vineeth/conductivity , or, but, but the [00:43:00] idea is there's this authority component to that.

Whereas, whereas they have this other idea of, of conductivity that they own and used in, in their graph. But like the web of documents, you can then link to those things. So you might voluntarily assert on your, on your website that like this, My definition of conductivity is the same as this person's definition of conductivity.

And, and what's crucial is, is by making that connection, you've, you've already linked the graphs. Logically. Now certainly there, there is like compute and data work. If you want to actually follow those links and ingest the graph, just like Google has to physically have servers that, that crawl the web. So the, the web just exists.

It, it's just these links are there, but they're not necessarily indexed or ingested. Um, but things can connect to each other and, and there might be latent structure there or not. [00:44:00] And so that's the idea here is, is yes, the field might not be, not, not finished with this graph, but the, the alternative is that it's currently not finished.

That there, I mean, research is still happening, so material science is not done, but, but how these little pockets of, of, of, of knowledge advancement are being encoded currently, these, these knowledge diffs, are as papers. Or as just, just text that have a bunch of untyped links to references and it's, it's a little better now cause a lot of them are DOIs, so you actually might get a graph, but, but still it's just a bunch of text that, uh, is, um, uh, theoretically a, a diff on knowledge.

So there should be some improvement on, on, on the state of the art of the references. Um, and that's just all happening in parallel. Um, but it's not machine actionable. Um, and so, uh, yeah, , [00:45:00] um, I realized I used like, like diff a technical thing. So, um, what I mean by diff I'm, I'm sort of playing a bit into, into some, uh, some inside baseball of version control and, and that that's sort of taken the world by storm.

So, um, uh, with, with, with the Git software version control system, um, You, you'll have a command called "git diff". And a lot of times people will just refer to "diffs" and you might, uh, uh, on, on GitHub, if you're, if you're, uh, trying to, uh, imagine, um, to, to migrate in some software that someone has proposed as part of a Pull Request, you'll be presented with, with, with what's called a, a diff a a a, a difference between the files.

So you might see a representation of, okay, this line was added, these lines were changed. And so rather than have to, um, read, uh, a, a full document and another one, you can have the changes displayed side by side, um, uh, [00:46:00] you'll often get a similar thing with a lot of editing platforms like, uh, you know, Microsoft Word will track changes.

And I think that that's probably like a way of, of, of showing, uh, essentially a, a, a, a diff differential representation where maybe the lines that were added are in green, the lines that were removed are in red. And so you can get a sense of, of, of. Um, if, if, assuming you already know the original thing, um, it's, it's, it's nice to be able to have a, a, a difference representation of, of, of what is the change you want to approve or not.

And that's, that's, uh, very, um, standardized with, with software. It, it, it's quite simple because it's, uh, the Git system is line by line. So it, it, it doesn't try to have any semantics. It doesn't say, say like, Well this function changed. Well this class changed; it's just like: this line was added. Completely agnostic to the programming language and it works quite well in a lot of instances to have, have this just line by line diff. Um, there's actually been some recent nice work on, uh, on Jupyter notebook dev to, to do Jupyter notebook [00:47:00] diffs that are not line by line, because Jupyter Notebooks are encoded as JSON. So the diffs are kind of a little hard. Um, but, but it'll, it'll render a Jupyter notebook as a diff. So anyway, um, I would say in the ideal case, um, I certainly would love as a reviewer to, to, to see, to conceptualize, uh, a, a research article, a novel research article as a knowledge diff uh, for the field. You know, essentially, uh, you're gonna be repeating a lot of what other people have said, and I might may agree with her or not.

So you've provided citations for that and, and you're saying something new and this, this needs to be a contribution to our field. Um, so this, this, this, So that's what I mean by, by knowledge diff, um, at the conceptual level for, for a paper.

Um, now we, we have a more fine-grained technical -- interoperable one might say -- view of it with, with code, um, just because things are, uh, nominally, uh, directories of files and, and one line per file and plain text.

And so you can review these diffs and, and decide to [00:48:00] incorporate, uh, a, a proposed change into a piece of software. We certainly don't have that form of a process for science. The, the, the pull review process is, is peer review , and, and, and, um, and even then the, the various knowledge repositories are, are distributed.

Um, you know, when when you submit to Nature, you're, you're submitting a Pull Request to the Nature repository . And then if you feel like Nature is, you know, a reputable journal, then, then, then that's great. And then you just, you know, whatever's in the main branch of Nature, you accept and they, they, they have retractions or whatever.

Um, so, uh, but the idea of semantic web is, is, uh, you can do this kind of diffing, but, but, but for, for data. Um, so, so you might have this, the same sort of structure of decentralized knowledge accretion like you have in, in the, in the publishing world where, yeah, there, there's, there's no one field. You know, thing when, when there's not a field, then often you'll start a journal to be like [00:49:00] this, this, this field needs a journal.

Um, and, but it's still very decentralized, but everything's compiled in documents. Um, so, so we might never have the, the full Materials Science characterized, um, but at least we might be able to, um, analyze it with machines versus just have to mine, uh, a corpus of documents. NLP is the only thing now, historically.

Um, whereas again, if, if people independently are building these graphs that can link together through web technologies in the same way that web documents link together, irregardless of whether it's effectively indexed, um, I think Google's done a, a fair, good job of, of indexing a lot of it.

Um, so, so that is another component is, is all of these people. These scientists making all of these, these graphs, um, it's, it's going to be of limited benefit unless there's some collection or indexing. And so I can understand the, um, the desire to have, uh, centralization, um, and like a, a sustainable model for, for, uh, you know, [00:50:00] maybe some institute holding all of the material science data so that it can be indexed there, uh, and uh, and, and served.

Um, but, but in terms of actually just creating the data and, and having it there to be possibly one day indexed. That can be done in a decentralized way. Just like anyone can write an html file and put it on a server and it, it happens to have a link to another server that may or may not resolve.

Um, so I, I think that that's, that's the idea is you would, you'd still, uh, have build little pockets at a, at a time that each, each research group could have its own little, little graph. Um, but you could actually more machine actionably link across the entire field in, in a way that's, um, more formal and, and, and more, more meaningful than just, um, citing a, a doi in a paper,

uh, from what you said, uh, is it fair to say that the data modeling part is mostly done manually today? It's almost exclusively done manually. It seems like, [00:51:00]

Uh, some of it's done manually. Um, but, uh, some of it can be done automatically, depending on, on, um, on how things are, are, are, are, are set up really. Um, so, so if, if, if two ontologies are using OWL and they have qualified relations between them, then it's possible you can ingest another model, um, by, by using a mapping. Um, but, uh, I would say still there, there is, there is some, some manual modeling.

Um, the, the, the value I think though of it is that you can always link together models and relate them, um, because of, of, of this triple. So if you have any concept, uh, in, in one domain model represented by a URI and any other concept that anyone else has ever developed in a domain model, You can choose or, or [00:52:00] invent a new predicate to put between them and, and link them.

Um, so it, it feels like while this modeling process, you know, can be, uh, manual for a lot of fields that, that don't have it, I, I still feel it. It's accretive. It isn't a matter of throwing things away and rebuilding all the time. And, and yes, we have to manually build, um, this, um, this, uh, this two-story house and we're dissatisfied with it. We have to knock it down and build another two-story house. It, it seems more like we can just. Adding floors. I, I don't know if, if the analogy is apt. So while there is some manual effort, I, I feel like there's, there's more possibility, um, for it to be extensible and, and build upon it. And so you can definitely lean on others, others' work, um, and not have to have to build everything from scratch.

Um, but I, I would view that as honestly one of, one of the selling points as well. Even if someone's done, done great work on a model and you're unsure about like one part of it, just create new [00:53:00] URLs for all your stuff, have your own play box, your sandbox, you know, have triples that relate. Most of the stuff is the same.

So you're saying most of the stuff is the same, but, but my little part's different. Maybe eventually it'll be fine, but you don't have to have to wait, um, to find the perfect prior art. You can, you can, you can reuse what other people have done. So that ability to, to, to, to build on, on partitions of what other people have done, I think also, uh, lowers the barrier for, for some of the, the manual effort as well.

Sure. Um, do you want to keep continuing, um, or ? We should probably, probably wrap up. This is exciting. I feel like I've, I've talked a lot. I think, I think the majority of time has been me talking and I, I, I hope you've gotten something out of it. Um,

so this has been great for me. Um, I really, really enjoyed it. Um, Okay. I, I, I mean, for me, especially the part that you said at the beginning where you said, um, about why we index, as in like, the role of the [00:54:00] indexes in creating the semantic tech, that sort of like lit a bulb in me and I feel like I suddenly understood what was going on. I think that was like the missing link that I was looking at, um, I was searching for.

Sure. Sure. So let's wrap it up now. So, uh, one thing I I want to know is, um, who should I invite next on this podcast? Um, I, I'd be very interested in, in that some kind of recommendation who you think might, might enjoy this, this kind of, uh, conversation or, or ones similar to the ones that you, you've heard before in the podcast.

Um, Yep. I would say, um, maybe Alan Aspuru-Guzik, um, at U Toronto. Oh yeah. Because he developed ChemOS, which is the, which is a sort of ontology that, uh, is being used slash proposed for high throughput instrumentation. So [00:55:00] an ontology for robots essentially, that do, um, high repeated experiments on a loop.

Yeah. Um, and he developed ChemOS, which is the software slash ontology that. Might be the backbone for it. Uh, so he might be interesting. Yeah. Great. Yeah, he's, he's a co-editor, chief editor of, of, of, uh, Digital Discovery and, and I follow him a lot on Twitter. Yeah. Right. Yes. I should totally invite him on; I'd love to talk with him. Thank you. Um, and, and, uh, another thing I, I want to just, just ask, um, for you, I ask people, um, if you could leave our, our listeners with, with any advice, broadly scoped, but, but something just because I mean, you, you've, you've gotten to where you are and I mean, just, uh, any, any advice to, to people out there, uh, who are on a similar journey?

You mean academically? Uh, . I'm not prepared for this question. . Okay. Okay. No, but [00:56:00] literally any advice like, um, like, like, like don't invade Russia in winter. Like, I mean, just, just, it could be any, any life advice that doesn't have to be particularly related to FAIR or, or, or that sort of stuff. Just because I, you know, you're a multidimensional person. We all are. So it, it could be anywhere. Any, any advice you'd, you'd like to?

What has helped me as a researcher is talking to people from all sorts of backgrounds, uh, like we are doing now. So that broadens my horizon and helps me think about other things that I normally wouldn't think about. Um, and there's a lot of, like, I feel like that's where creativity comes from, uh, at the intersection of completely disparate things. So maybe that's something that, you know, uh, if people not are not doing it already, they could try it out once.

Great. Yeah, thank you. And I, I would, if you don't mind, I'd like to bring that to semantic technologies just because, um, [00:57:00] you know, there are disparate things, but, but if, if you have a handle on those things Yeah. Then, then you, you can, uh, uh, relate them in, in qualified ways, and, and, and like, say, like, I think this is the intersection. No, I think this is the intersection. Um, and I mean, relating that to, to FAIR, that's all about, uh, principle I3 of interoperability is, is you can relate things through qualified identifiers. And, um, I'll just touch back to, to relate, uh, the importance of indexing. That's sort of fair principle F4, you know, like your, your stuff should be registered, uh, indexed in a searchable resource. So these are sort of very, very important things.

Um, so thank you for that. I, I, I agree with that as well. Just, I, I love, love, uh, talking with people. Um, uh, yeah. And it's. Yeah, there's, What is that chemistry curve? I, I forget where it's, it's a potential where like, you know, it's high when you're very [00:58:00] close and it's also high when you're really far, but there's kind of a sweet spot for, for like a dip. Oh, I, I feel like, I feel like it, like, it has to do with chemical bonding. I think maybe,

But I, I guess my, my thing with that is, is with intersections, I feel like things that look very, very close to like what I already know probably aren't as interesting. And things that, like, I don't even have something to like grab onto like, like my, my, my mental space isn't there, but there's some like in between where I feel like I have some tentative connections. But not a lot. And interacting with someone just enriches that neural network so much where I already have some connection.

Right. Um, and, uh, and I would venture that's also, um, a, a flywheel effect point for semantic web stuff. So like, the, the quote is a little semantics goes a long way. So even if you can link a little bit to something else, that might be a branching point for someone else to, to meet [00:59:00] someone else in that other field because you've, you've created those few links and they can help enrich the graph for you. Sure. Um, okay. Enough pontificating, , uh.

Alright folks, that's it for today. I'm Donny Winston, and I hope you join me again next time for Machine Centric Science. Thank you for joining us, Vineeth.

Thank you Donny.