(This transcript was auto-generated and then outsourced to edit for clarity, but I have not fully reviewed it. If you notice errors / confusing bits, please let me know: firstname.lastname@example.org)
Donny: Hello, and welcome to Machine-Centric Science. My name is Donny Winston, and I am here to talk about the FAIR principles in practice for scientists who want to compound their impacts, not their errors. Today we're joined by special guest Christophe Blanchi, currently executive director at the DONA Foundation, but also a rich history, many hats to talk about, current and past. Thank you so much for joining us, Christoph. For listeners who don't know about you, can you introduce yourself a little bit and why I'm so excited to have you on today?
Christophe: Well, thank you very much Donny, for inviting me to this podcast. So a little bit about me. I'm Christophe Blanchi. I used to be – so I'm a computer scientist by a profession originally, and I worked for a long time at the Corporation for National Research Initiatives led by Bob Kahn back in Reston, Virginia; and worked on these issues of digital object, infrastructure and architectures with, you know, a lot of emphasis on digital libraries in the early nineties. And as time went on, we developed more projects applied to DOD or financial transactions or other things. And those interests led me to be nominated to run the DONA Foundation, which was created in Geneva in 2014 to operate the infrastructure that CNRI had created; mainly the handle system, which many people would be familiar and have seen; you know, as DOIs. The creation of the DONA Foundation in Geneva was done mostly for neutrality purposes, so that this handle system could be operated in neutral grounds without any particular – I would say – liability from political concerns; mostly at the time stemming from the State Department. So this is where, and why, the DONA Foundation was created in Geneva. And so, from here, we have continued developing the global handle registry and the handle system, the ecosystem with a focus on attracting more foreign bodies, different organizations and different parties to try to, you know, further the interoperability of identifiers globally and to create a community that leverages this architecture and, specifically in this case, the FAIR principles really match well the sort of activities that the DONA Foundation is involved in. So, in a nutshell, that's what I do. I used to write a lot of code; not as much anymore, but still do as a hobby or just to get certain things done. But I'm really a technologist from the ground up. And although I do really get more and more interested in the human aspects and the organizational aspects of infrastructure, I do still ground all of my concepts from, you know, the practice, which is "can this run?" "Can we imagine the building this a scale?" But I must say that the human aspects are maybe something that technologists would tend to forget about. B I think they play a key role, of course, in making all of these architectures run and be developed and adopted.
Donny: Right. Yeah. Thank you. I definitely agree with you and I love to geek out on technical stuff, but yeah. It's really been apparent for a while to a lot of people and, you know, people who love coding and they'll maybe get a little older and more experienced and they'll realize, "well, It's really sociotechnical and it's harder to manage people than manage code, even though code is really hard to manage too." So yeah. What I'd like to really dive into here is this idea of identifiers really. It's the first thing about FAIR. It's the first thing about the first thing about FAIR. It's "F1: (Meta) data are assigned globally unique and persistent identifiers." And everything kind of flows from there. And there are a couple parts of that. There's the global uniqueness and there's the persistence – and then there's also the identification. But it feels like, you know, persistence is sort of where we'll maybe get a little bit more into social stuff, but also also global uniqueness. You know, people deciding not to clash. And I guess what I want to ask about a little bit is maybe some of your perspective on the ecosystem of global unique, persistent identification and maybe how, historically, the handle system grew up in that kind of an environment. Maybe people can have have a sense of, well, how do handles relate to http URLs and maybe how does the handle system relate to the domain name system where it seems like you can also reserve a prefix of some sort, a domain name and how that kind of all, all plays out. And maybe a little bit of, maybe, like, why the handle system has numbers. Just this whole idea of... People can get a domain name and park on it for a while. But I mean, just sort of how that grew up. Because I know maybe there was some question in the early days about the web DNS versus digital object architecture and Kahn's vision and how that's evolved today. So maybe there's this whole slurry that's evolved and I'm wondering if you could speak to that.
Christophe: I'd be glad to. I must say it was my bread and butter for years, right? So it's a topic that's close to my heart. But I think this is a very important topic because, even today, we hear about, you know, conflictual information about what the handle is. So I think this is a very important topic. Just a little bit of background. Bob Kahn, when he created the Corporation for National Research Initiative, he basically wanted to have a, what we call a, a little experimental box where he could think about concepts and then try to move certain of these infrastructures forward. The story that we used to tell ourselves at CNRI is that, after solving the issue of packet transfer on the internet, the issue became: the internet doesn't really know what it's transferring, what happens or what would it take to make the internet aware of the nature of the information it was shipping. And this is, you know, a core concept behind the framework for digital object services that that seminal paper, which...
Just a story aside, when I joined CNRI, my first job was "read this paper and implement it." At the time, I thought it was a little bit abstract for a straight implementation, but that was life at CNRI, where you just had to, you know, make do with what you could think and then make a good argument for Bob to see why you were doing what you were doing. So, early on, this notion of coming up with a model for describing information on the internet was really the core concept behind the digital object. One of the things that Bob Kahn always mentions in passing, when he talks about the digital object architecture, is this notion of "nobots": these mobile pieces of code that would go from one place to another. It's like "no" and "bots", and it's a concept that still carries some validity. But I think it was, at the time, a reflection of the fact that computation was very limited and the idea of being able to parallelize a task and send these nobots out to places. And that was really another core part of this architecture; where these nobots would go to digital object repositories and perform operations on these repositories and aggregate information according to various algorithms that were intrinsic to the nobot, and report back to the launcher of the nobot. And Guido Van Rosssum – who was at CNRI at the time – and his team were developing Python, but they were also implementing nobots to demonstrate the value of the architecture. So that was, that was a very interesting project. The reason why Guido was selected – that was just before I had arrived, Guido had just joined CNRI – and a lot of the concepts of the pickling of the digital objects using Python pickling was there. But that's another interesting historical note.
Donny: I did not know that Guido was at CNRI.
Christophe: Yeah, he was. It was one of the earlier versions of Python was there. And there are some versions of the handle system that were implemented in Python, partly because, well, Python was one of the earlier implementations of the digital object repository and some of the implementations of the digital object architecture were implemented in Python before. I think, at that time Java was taking off and, unfortunately Python lost out, but it's winning out now. So, it's just a question of being patient. But to go back to this notion of "what is a digital object and why do you need it?" Basically, you need to have a model for information and services that is fairly broad, but yet sufficiently specific that you would recognize a digital object when you see one, and you'd be able to interact with that digital object in a way that is consistent enough that any computational device could start a conversation with a digital object. And this minimum level of specification and interoperability is the key aspect here.
One of the things that Bob likes to say is that "standardization is a substructive process." You want to have your specification be as minimalistic as possible, so you just need to put the minimum necessary in your standard and any convenient function that you would like to see – because it makes your life easier – as long as you can implement it using the other basic concepts you should remove. And so, the digital object architecture is one of these architectures that are substractive. So, you look at it and you don't have that many hugely different concepts. It's just that they organize in a particular way that you can start building large scale distributed infrastructures where clients can have deterministic interactions with everything regardless of what it is that they are, and leverage the core notion, for instance, of typing within the digital object architecture, to lead you to the next question, which is: "all right, I found this digital object I have its ID, I can find out where the object itself is, and I can find out the types of its values, and if I don't understand the types, well these types themselves are gonna be defined as digital objects, and if I go down the rabbit hole, I can find out what these types are, maybe download some code or acquire some, you know, regular expression that allows me to do something with it and proceed to the next step." And one important thing about the digital object is the notion of "kernel metadata," which, in of itself, was never sufficiently defined in that, you know, date created, date less modified, type identifier, and, maybe, you know, what you would call metadata. But, very quickly, you run out of things that are – should I say – that apply to everything. Like, size is already starting to become a little bit of a problem because "size of what?." The digital object itself may have a size on disk, but the digital object – you know, the, the size of a particular element within the data, the digital object would make sense but a digital object would typically not have a size like you would think of a file size. It's not really the nature of the beast. So, going back to this architecture – one last thing. The notion of an identifier was a critical part and, surprisingly enough, was supposed to be the easy part. And so, this is where the handle system was born. We needed a way to identify these digital objects and, you know, how hard can it be?
So a lot of the funding at the time was through digital library funds and ancestral funding from DARPA and NSF, and that's what CNRI used to develop some of these technologies early on. And I would say, by the mid 1990s, already, it was clear that the institutions that were interested in the handle system and what it offered, which is a level of dereference, were the publishers. Because the publishers were new to the web and they realized very quickly that URIs were breaking a lot – for all sorts of reasons. It's not because DNS doesn't work, it's not because URIs don't work, but it's mostly the social-technical fabric of what you put in a URL. And, so, the publishers realized that they needed an identification system that was independent of the underlying technology and the naming conventions that were part of the typical url. And so they were the first ones to say "all right, this is the right approach for identifying journal articles or things that we need to identify over long term." And this is where this notion of persistence already started, you know, rearing its head. And in this case, the persistence is not that the implementation has persistence – although, you know, an implementation that doesn't change over time helps a great deal with persistence, not necessarily the only requirement. It's an important one, but it was this idea that you could hide the management of the links behind an identifier – and that nobody would be the worst off for doing it. So it provided you with this ability to manage this dereference. And that was something that, you know, not so many identifier systems at the time provided. Now, you could make DNS do stuff like that, for sure; but that's not what DNS was originally used for – you know, you identify mail servers, you identify IP addresses, but you're always on the level of a service. You're not really trying to identify individual things. That was not... and even the way that DNS is managed... so I guess we're having this DNS conversation now.
The way DNS is managed; it's really meant to be managed at a greater level of granularity than, let's say, the typical handle system is. And so, that reflected, well, the sort of things that you do with DNS, which is: give companies names and let them manage their sub names within their own names and all that works well. But if you're talking about top level DNS – you know, dot com, dot edu... or the next derivation below, you're talking about zones, you're talking about an update that is not at the granularity of a single admin within the company that gets the DNS name. So, the handle system, from early on, had this task of being able to identify anything and everything and allow this handle system itself to be used to manage its own values with an administrator that was itself identified with a handle. So, from early on, that was the design of the handle system, which wanted to use PKI challenge response to identify entities within the handle system to allow for any level of granularity you're looking for. And that is the original RFCs of the handle system, you know, speak fairly clearly about how this works. And the latest update to those RFCs, which is DOIRP version 3.0. That's an open standard through accessible through DONA. Still describes the exact same mechanism and the main updates to the RFCs are the fact that, nowadays, we have a distributed global handle registry, whereas before, CNRI was the main administrator of the global handle registry.
And that is an interesting comparison because, in the DNS space, you have your 13 roots, but there's only one that's a primary and you don't really have a true multi-stakeholder administration scheme; you have a single primary DNS server that's, I think, still maintained by ISI – but it's under California law, and that's where the primary is. Now, you have other mirrors around the world, but they're mirrors. You cannot insert in those mirrors new values. So, while ICANN is clearly a fairly neutral organization that manages this, if you go down to the legal entities that control ICANN, there's still some degree of us control. Now, clearly, ICANN is not enforcing that to do anything nefarious at all. But it's interesting how some vestigial remains are there and somebody who wants to speak ill of ICANN could just, you know, point to these things and just say, "well, you know, are you totally multi-stakeholder?" "Are there some potential legal limitation?" "Could the State Department influence what ICANN can do on this primary server?" So, I mean, those, those are... Honestly, there are no simple solutions to these things. It is what it is. And I think, you know, ICANN does a great job at managing DNS and it's something that will remain. I don't think that the handle system and DNS are competing with each other. We serve fairly you know, orthogonal purposes. Yes, we're identifier systems. Yes, we have a resolution protocols, but we''re not trying to deal with the same sort of things. You could use the handle system to point to services. You could use DNS to point to objects. I think it's a question of, you know, what is the technology that suits your problem space the best?
A lot of IoT systems – and also RFIDs – were implemented using DNS, and other systems were implemented using the handle system. So, you know, my view is take the technology that works best for you and go with that. There are some interesting things about DNS and the handle system. I would say in the late nineties, the Chinese wanted to be able to have Chinese characters in their DNS names. And at the time, DNS could not support that. So they actually implemented a handle system-based DNS. So it was a handle system where the administration would go through the handle system administration, but the database was also exposed through bind so that you could resolve the result of these secure, identifier-based administrations. And so there were some experiments like this. So I don't think it was deployed at any particular scale. And now, you know, DNS supports multiple character sets. So this is no longer an issue. But the administration of the DNS is really – I think – the biggest notable difference here. It's really the granularity of administration and also the extensibility of what you typically put in a DNS record. The values that you put in DNS record are more – should I say – standardized. They could be extended, but most resolution systems would not understand what they are, whereas the handle system is really meant for clients to put whatever they want in the handles. You know, any type value pairs that they see fit can be put. Now, clearly we end up with the issue of interoperability. So what do you name these type value pairs? Our preference is to try to use handles to identify them, or PIDs of your choosing. And so, this is where, you know, FAIR and the digital objects share many of the same concerns.
Donny: I see. Well, yeah, if I can clarify that. So, for example, um, I have a handle prefix that I, I have from CNRI. It's – I'll put it in the show notes. It's, it's like 20.500.14132. And I created a handle "20.500.14132/chris" and it currently has one key value pair record where the key is URL, which was what was a default in the interface. And it points to your landing page on DONA.net.
Christophe: That's right.
Donny: But yeah. I do also see this. There are other things, like, yeah. "HS admin" and there's "other," and so, that's interesting. So, there's this extensibility where, like, maybe the keys, and, yeah, just in general... I'd like to walk through... Maybe some of our listeners have some more familiarity with the DNS system for setting, simulating, identifying and, like, how that works. And I'd love to go into that and some of the parallels of what things are called in the handle system... I mean, you got into something very interesting, which is about even at the very top, there's like cert something to say between relating ICANN and DONA maybe, or at that level. For example, if I want to make a persistent identifier using HCP DNS, how I would go about it is: I would go to some site that claims to be a registrar. I use this company called DNS Simple. And they are able to give me a full – I guess in the case of dns it's almost like a – reverse prefix where, rather than CNRI giving me 20 dot something or, or some DUI foundation delegate, like crossref or Data site giving me 10 something, you know, this registrar can give me something.com, something.org... whatever it can do at the top level domain. And so I'll get something and I'll pay them a certain amount per year for that domain. And then I'll go into their console and I'll manage DNS records. And I can also do something with a registrar where I can tell someone else to manage my DNS – like CloudFlare – and it's, like, I'll put DNS records in there and, essentially, I'll maybe spin up a server somewhere with an IP address, and then I'll create an A record to that. And then I'll stand up an HDP server there. That will then serve a page. And that's how I get my URL. And it's up to me to kind of make that persistent. And so that's how that sort of works. There's this ecosystem of registrar and then maybe a separate DNS provider for records, and then there's some host that will give you that static IP. And then, yeah,
Christophe: There are a lot of moving parts.
Donny: A certificate, which now you can get from Let's Encrypt. And then you need a web server like Apache or something on that Linux system. And you need to serve like an HTML response. Or – I guess – in HDP, you can decide to serve, you know... use http content negotiation. There's sort of all this stuff of like, how do I like get a machine – like the equivalent of a nobot – some machine program to ask "what is this?" And that's like a get request. Just from that perspective of someone who's technical, who's not afraid to dive into things, who like, wants to like get an idea... How does that work in the handle system?
Christophe: So, the handle system. So, you did a great job of making my point, which is: it's not that you can't do it with URLs, it's just that it's actually a lot of work and there's a lot of inconsistency because, you know... So, my question would be: do you really have an identifier or do you have a webpage? And if I'm trying to resolve your identifier, I'm not sure what I'm going to expect to get. I could get lots of stuff. Like, it could get a webpage that is human-readable; I could get a piece of J S O N, you know; if I'm lucky, you maybe have a way to give me a signpost, which is easy to parse. So there are a lot of possibilities and it's not that it doesn't work. It's just that, from a client perspective, you can imagine that your chance of having an interoperable result from a resolution to a URL when you're expecting an identifier resolution, your chances of succeeding are fairly low. It's not that you couldn't standardize the stuff, it's just, at the moment, it isn't. And, and there's not a huge demand for it either. Yet. So, in the handle system... Okay, one last thing I'd like to say is that there's a strange level of mixing of layers. So, you have HTP with an access protocol, but you have nothing that defines your data model on the other side; HTML is a formatting, it's not a structure. So there are a lot of things that are not defined enough for a client to say, "oh, I know how to deal with this." So, in the handle system, the fact that you have a protocol to resolve an identifier into a handle record – in this case – gives the client a lot of certitude as to what to expect. You're going to give this identifier to a protocol and the response will be a structured result of type value pairs. It's a little bit more complicated than that. They're a little bit of substructures because each record has a little bit of key metadata of its own, like who can manage it, when it was created, etc.
So there's a little bit more information. But, to keep it simple, you get a bunch of type value pairs and then you just pick the type value pairs for the ones that you know. And if you're maybe a smart machine, you might be able to reason about what you have. But, compared to DNS, the stuff that you have to put in place is much simpler. You just have to have a local handle service somewhere. That maybe you maintain or somebody else maintains, and your prefix has to be homed to that local handle service so that the global handle registry, when it gets involved in the resolution of the handle by your client, will be able to say, "client, that local handle service over there is responsible for this prefix and all the handles created under it."
And so, the process is fairly straightforward, especially from a client perspective. Your client before had to pick the URI, figure out that it's https, extract the DNS information, resolve the DNS, get the server information, then get the port number, then connect using that HTP, send the whole request, and then you get something back, which is potentially unknown to your client. And you're going to have to do something with it and hope that your machine is able to make heads or tails of it. Whereas in the handle system, it's a simple resolution where the only point is to get from that identifier an expected structure back from the local handle system after querying the global handle registry, "who's responsible for this handle?" And once you get that address, you just connect to local handle service and do the resolution there. So, the flow is simpler, and I would say this is partly because the handle system is a one-trick pony. You're just resolving handles into handle records. HTTP is much more flexible and can be used for many different things.
Christophe: And so, you're not sure what you're gonna get because the default... What I would say is that HTTP is really geared at humans and humans being able to figure out what this is. But when you're talking about machines, this is where this ability of humans to figure out what it is that they're looking at escapes what most machines are set up to do.
Christophe: It's not that in some distant future you cannot have some AI system figuring out what we're doing. But it's just that it's an overhead that is not necessary and can get really impossible to man manage at scale. And especially imagine if your landing page was actually a dereference page, like a DOI. And then in some case you would just put some string somewhere that says, "oh, this is this is a landing page, and then go get that other URL to figure out where the data is." So what's the analysis? Is any link a dereference? How do you figure the context? Whereas in the handle system, the context is much easier to grasp. It doesn't mean that it, it cannot be complex. But if you take the case of a DOI, it's super simple. You get the handle record and you look for URL and that's your redirect. Now, you know, DOIs are a very straightforward use of the handle system. It's a straight redirect.
Christophe: There are other cases of application where it's much more complicated. And if you think about the FAIR approach for a while, the identifier record – if you use the handle system to implement FAIR – would have references to the metadata for the object as well as its information and could have other information such as a hash of the data values so that you can verify that the data is still the same thing as what you're pointing to. So you can add a lot of complexity there. But, uh, to just understand that this is a FAIR object, you need very few values to extract. What's the FDO type (the FAIR digital object type)? Where's the FAIR data, the FAIR Digital Object metadata, the FAIR object data? And that little aggregation piece provides you with a few things.
a) It's the ability to identify the data (the digital object itself)
b) The other one is figuring out where the metadata for that digital object is, and the ability to get to it, which would help you figure out whether this is what you're looking for.
c) And finally where is the digital object?
Christophe: Now, the devil's in the detail, and part of what the FAIR community is doing is to figure out what are the minimal requirements that we need to put in place to define these relationships. But I took a side note on the, uh, on the, the Fair digital object thing, but going back just to the handle system. I think the single mindedness of the protocol is an important thing. And I think this is a distinct concept when you think about using identifiers or other identifier system. The resolution of identifiers is something that needs to happen very consistently. If there's guesswork involved in resolving an identifier into this dereferencing information, you're making the task of the client much more complex and much less deterministic to the detriment of the machine, readability of the whole thing.
Christophe: And I would really like to stress that it's not that you do not have other identifier systems out there; it's just that they're not necessarily as robust from a model standpoint. And I think this is the thing, and I always say, you know, use the identifiers that you like. So for me, an identifier is just a string. I don't really care what system you use. Semantics has been proven to be a sort of love hate relationship; once you're married to it, it can cause you problems. So I tend to like identifiers that are not too semantically relevant for all sorts of reason. Be that as it may, for me, the important part is, "so you have this identifier, how do you resolve it?" And if you look at the web and the typical url, you have this mix; the identifier is the same as the retrieval mechanism. I mean, clearly not everybody has a query string in their URLs. But a lot of them do. And so then you're stuck with that particular retrieval mechanism forever. And it's not that you can't maintain it, but in a hundred years, are you still going to be managing your stuff with the same query mechanism? Maybe, maybe not. But if you do, maybe you regret the decisions you took a hundred years ago. And that's one last thing I'd like to say about the handle system.
Christophe: People like to say that the first 50 years of the internet will disappear. Maybe, maybe not. Some parts will be remembered, but if you're trying to build science and – I would say records, especially for social applications – you need to think about the next 50 to a hundred years. You need to try to find ways to do it that doesn't break every 10 years or every 20 years. And I think those are questions that not many people ask themselves, partly because we don't have good solutions. But I think it's one where, if I had to say something about the digital object architecture as a whole, that's what it's trying to achieve. It's trying to ask the questions: "What is needed from a technology standpoint?" "What are the minimal concepts that remain true regardless of where we're going?" And I think this notion of resolution of identifier... That will never get old. Now, of course, if you have – I don't know – entangled particles, entangled photons... somehow can lead from your identifier to the real thing, who knows? But we're not there. And I think dereferencing is an important concept. I would say one of my motto is "when in doubt dereference" because dereferencing means that there's an extra dimension between the two things that you're trying to do. And that thing is a variable. And when you have one of these variables in between two points, you'd better have a way to deal with it that doesn't involve either one of the points. And I think this is what the handle system provides from an abstract level. It's a dereferencing level where you're not sure how the identifier is related to what it identifies. And having this dereference allows you to add more smarts or more layers to deal with this bridging over time.
Donny: Cool! I want to summarize a bit of stuff that maybe you could debug my analogies or thinking. Im trying to place where this is. One overall theme that I'm hearing a lot from you is that there is a technological basis but there's still a lot of culture and culture and discipline involved. Like, even if you have the simple mechanism of key value pairs, do you have the culture and discipline of having meaningful keys and those keys having handles maybe, and then that sort of semantic. And I appreciate the idea of having a really small conceptual service area. Like you mentioned, back to Kahn – like the hourglass model that's characterized in the internet. And just the idea that there's not so much to be ambiguous. One of the things that I recognize, at least nowadays about HTP DNS stuff is that there's so much tooling that's been so widespread about it that a lot of people might feel like it's simple because they can just pip install something all the libraries do the same thing.
Donny: But, as you mentioned, there really are a lot of choices and those may not remain consistent by plant communities over decades. So, to have something very simple at the model level, it does have some value, but I can see a lot of people confused in practice because there's been a lot of interleaving. Like, someone can go to purl.org or w3id.org and throw in an Apache config in a GitHub repo of their reserve name and have a permanent id, and that's all built on dns, but they don't need to have the DNS or the HTP certificate or, or whatever, or run the server... But all that stuff is still happening underneath. And I also see something similar with this key value pairing; attempted standardization with HTP headers. You might have a HTP head request and key value pairs and certain of them are supposed to be standardized. Like, there's the digest header that should be a hash in your thing; there's the content length header, which should be like the length of your content in bytes and it should be very standard; and the mechanism for custom thing is the convention where you say "x dash something" rather than having a handle as your key.
Donny: Maybe if you could comment briefly on the idea of the semantic web and what keeps popping up as this idea... I want to get something back that's key value pairs where I can pick the keys that I want and know exactly what they are, and I see a lot of development about this with, say, googleschema.org and returning JSON LD responses where you're gonna get this JSON response of key value pairs where the keys ultimately should be be referenceable. It's interesting that, as communities have a need for more ideas of persistence, they're kind of tug-of-warring in two different worlds to try and make it happen. Um, yes. And yeah, it's fascinating to me.
Donny: So, I'd love like your general comment on that. I know we're wrapping up on time, you know, our response to that... But also wrapping up on, kind of, the future as you see it of the handle system and its intent for leadership and the persistent identifier space and its ongoing relation with other resolution mechanisms. If you could possibly condense it to researchers?
Christophe: I think this gets to the crux of some of these issues of how do you move forward. To address your point about the availability of tools to make complex things really easy for people to do – that's absolutely true. And I think if I default the handle system, it's the lack of available tools and the lack of available integrated use patterns. I think there are reasons for that, but I think the community has not been as good as we would've expected by now in developing open access to tools that make are your life easier.
Christophe: And I think this is something that the DONA Foundation really takes to heart because it takes effort, but it also takes funding; and those things are driven by use. And what I could say is that the DOI foundation – with DOIs – have been able to do a lot with fairly light resources, and especially in the academic community. A lot happens with very small shoestring budgets. I think when you get industry tied onto some of these activities, then it takes a different nature. And I'd also to say, in passing, that the Chinese are the biggest users of the handle system – potentially by orders of magnitude – but we would know it here because we don't see what happens in China. But in China, they are using the handle system to do tracing and tracking of pretty much everything and anything under the sun. And they have use of the handle system as ways to program sensors and read from sensors. So they're sort of the leader is when it comes to thinking outside the box and using the technology. They would potentially have more tooling, but we don't see it. And this is too bad because the things that they're doing in China are fascinating. But the rest of the world doesn't know.
Christophe: Going back to this issue of there are similarities in my headers compared to these type of pairs. I totally agree. And I would sometimes say that the difference between the web and, let's say, the digital object architecture, goes back to this notion of globally unique identifier that resolves to a structure. Because if you take, for instance, the famous file size in your header, everybody knows that the file size should be the size of the stuff that's below. But if I wanted to create something else... Good luck, because you could always create X types and things of that nature, like the mimes types that you would use. But, your chance of having the world agree that that string means this is pretty much zero.
Christophe: And this is a difference – I would say – between the handle system and the digital object architecture, and the web: when you know that your identifier is globally unique and you use it to specify something; immediately that thing has global relevance. Whether people have seen it or not is something else.
Christophe: Whether the structures that you've put in that record associated with the identifier speaks to other communities is, again, another problem. But this is the same whether you're in the semantic web on the web or the handle system. How do you convey to people that there's a particular way to describe something is... This is where the social engineering comes to play. And what do these people have to gain? Do they have an existing base of tools and use cases and deployed infrastructure that can leverage it? We all suffer from the same thing, so I think your point is totally valid. There are not that many differences in what the handle system record returns if you compare a U URL head and a handle resolution. You could say "quacks like a duck, it looks like a duck, it must be a duck." It's the same thing. And you could say, "yeah, it is." It is, from a pattern standpoint, the same. The difference stops in the names that you put in these things in the heads.
Christophe: Again, the community is not saying, well, we're gonna put URIs to identify these strings. Whereas in the handle system, you'd be more, like, "oh, okay, I'm gonna put a handle to identify this type; I'm gonna make it resolvable." But even then, more work needs to be done.
Christophe: Tying this back to FAIR. The reason why I think the DONA Foundation is really involved with FAIR is that FAIR is taking the digital object architecture. And I'm not saying they're not taking the handle system or DUIP or these standards per se – because the FAIR community wants to be open to the semantic web and other technologies. But the concepts remain the same. And what FAIR is trying to do is just define enough of the semantics – just the next layer of semantics – that allows you to start asking the question, "so what is this object?" Because the digital object doesn't actually tell you. It has the notion of an object type, but it doesn't actually spec out the space of object types. It's up to the communities. So FAIR just tries to say, we're gonna just define a little bit more, "so that your machine can then say, 'what is this object?' and 'do I care about it?'"
Christophe:So one of the things that I tend to always argue for in the context of, for instance, the metadata, the FAIR metadata... We've seen how many years it took for the Dublin Core to become the Dublin Core. And I'm sure that half of the community in Dublin Core were not happy about the final result. But what we have to say is this particular metadata type – so, let's say you will wanted to say Dublin Core – well, when you identify this in the DAIR DO you have to say this metadata is actually an instance of a FAIR digital object metadata type. So you could have multiple types of metadata. But they would all be consistent with this notion that the FAIR digital object endorses a collection of types. And just by saying that maybe one or two metadata schemas are recognized by the FAIR digital object community as being acceptable, it allows you to add a fourth or add a fifth. Now, this is, of course, a little bit dangerous because you could have silos of metadata. But if you have enough tooling and you know that there are, maybe, two main metadata schemas for FAIR digital objects, then you can find out utilities that would map them a little bit more carefully and then you get some level of interoperability – not as good as you would have otherwise. But I think this community cannot – we cannot – really rule out things and we have to remain fairly open-ended. But clearly there's a level where, if you keep the standards too "complex"...
Christophe: If you're trying to make standards too specific and keep them interoperable, the machinery that you need to be to put in place from the standards perspective becomes really heavy. And so, we have to pick our battles and one – I think – potential solution to this is what I call the notion of "type genres." So you have types, but you need to categorize what types are so that you can, from a machine perspective, quickly say "is this data or is this metadata? Oh, this is metadata. Okay, so I know." Or "is this an operation?" Or "is this a PID record?" Or "is this a profile?" So I can quickly make sense since I know I'm not interested in profile; I'm really just looking for the types of metadata you have. And I think, from an architectural standpoint, saying we recognize their types and we recognize that there are various classes of types that are broad in nature. (I'm not talking about, you know, within metadata, two different types. It's really general broad types.) I think this is one of these minimalistic, architectural-level standards that allow people to say "how many genres of types are there in the FAIR digital object community?" And that should be, you know, maybe six or seven, but then everything should fall within those categories. But if you needed an eighth, then anybody could do it. And then the question is: "Anyone can do it, but if nobody uses it, does that matter?" No.
Christophe: And this is one of the interesting issues with typing as a whole, is that you could type things, and especially in the case of a handle system, anybody can create a type anywhere and use it. And that type will be resolvable the moment you see it. But whether it gets any use depends on the solutions that this type brings the community. And this is where the dynamics of the social contracts become key again. If you're trying to solve a problem at a large scale, chances are you'll create communities that are interested in these problems and will potentially put implementations behind these concepts and make them useful. But if it's just a great concept, but nobody uses it, it doesn't matter. And so, I think we are all fighting that battle of relevance; of who's solving the problem best. And, and it's a challenge. And if we don't solve the problem best, well, we're not solving it best. And so be it. I mean that's, you know, that that's the, the great battle of ideas.
Donny: Absolutely. Great. And I love this idea of type genres. I don't know how similar it is. It reminds me of a similar effort that tries to catch on in the semantic web RDF community. They'll have this idea of an RDFS schema of vocabulary; this idea of a class; and you'll sub-class that, and so you might see all of the classes and Google dataset search will accept DCA data sets and there's this sort of idea of trying to express that, but it's still sort of on top of HTP and you have to know it.
Donny: That kind of idea. I just love that. And I also love you making the distinction between: with HTP you have content types and MIME types, and you can have response types and you could have a response saying "this is JSON", but that doesn't actually tell you anything.
Christophe: Yeah. It's funny. The thing I find always amazing is how far we've gotten. and how Duct-tape-ish a lot of this stuff is. And it's actually remarkable. And I think part of the... If somebody says, "how come more people are not using the handle system because, you know, if it's so good – why?" And I think my answer nowadays is just that people have found solutions that maybe are non-complete, but that serve their purpose and they're going with it. And the thing I find fascinating about FAIR is that it shows a certain maturity when it comes to data on the web. I think, finally, there's the sense that there's too much data that is not being able to be used, reused or accessed or processed because we really have no clue what it is, and there's nothing out there that really helps us. And so, it's fine when this is not costing too much money, but when it's starting to cost money and/or lives, then I think, you're, like, "okay, so why is it that this is not working?" And we have to sort of rebuild a foundation that says, "well, no, we can't just assume that people will figure out what these data sets are and how to get them." No, it just doesn't work because machines have to do this and machines maybe are not as patient as me.
Christophe: I remember an example. A colleague of mine used to work for Marriott back in the US – Marriott, East Coast – and they wanted to do some analysis based on customer data – figure out some patterns – and they had data from Marriott West. It took them six months just to figure out how this data was structured. And this was within the same organization; six months of wasted time with a fair amount of team behind it. So, you know, we can't keep going on this way. And so we need something to structure this information in a way that's a little bit more more conducive to good results.
Christophe: The other thing I would say is that Google, Amazon – all these groups, fantastic products, fantastic services – but especially when you take a European centric perspective... If you wanted these clouds to interoperate, they're not gonna interoperate just because they're good guys. No, they're gonna interoperate because there's money to be made. And right now, exchanging data from one cloud to the other is not where the money is coming. But if I have pictures on my Apple account and I want to move them to AWS... Good luck. You're gonna have to copy them locally and then switch them over. It's like the good old days of phone monopolies. You could not change your phone companies until the mid eighties and the US and then Congress had to say "no, you have to have the portability of numbers across companies and services." We need to have the same sort of thing for data and, ultimately, services and devices, because those things aren't the same.
Christophe: I think this is where open standards that are not necessarily controlled by a couple of large organizations could make a difference. So, it would be great if, you know, the Facebooks, the Googles, the GAFAs of the world, saw that there was some validity in coming up with a way to interchange data between their clouds. So, at first you start with data, but maybe it could become services after a while, or IoT devices. And I think those are larger, complex questions.
Christophe: And just to open up the topic a little bit more, I think, from a societal standpoint, people – most people – do not know what their digital rights are; or, even worse, what they could be. So how do you socialize a notion that maybe you do have the rights to not have Amazon keep all of your financial data? Maybe, you should be able to dissociate your card information from your account at Amazon, and maybe there's a law that says that, but people are not thinking along those lines and they have no clue and they wouldn't even know what questions to ask.
Christophe: And I think – not that I want to make FAIR more than it is – but FAIR, I think, asks these questions with it. Okay, so now you have fair digital objects. Can I FAIR Digital Object my persona on the internet? Can my digital information as it's controlled by my house – my home country – can I access this on my terms? Do the products I own – can I have access to that data? I think those are larger questions that, if you have interoperability at some level, you can start asking and expecting laws to influence how your data is managed. And I think if you had to project FAIR moving forward, I think this is some of the democratization of personal data, with the ability to quantify and inspect its use and, hopefully, prevent its abuse. But I think we're in the wild west of personal data at this point. And it's good to think that we'll have a solution at some point, but I don't see politicians asking those questions, and I don't see the population as a whole really educated enough to be able to ask for those things. And you don't need to be a technologist to know that your rights are trampled when it comes to information. But where do you start? Who do you point the finger at? It's not because Facebook generates the data that they're the ones using it... Yeah.
Donny: This is great. I'll try to wrap this up a bit – respect your time. Just bringing it back to identifiers. I love that example of – I dunno if this is intent intentional or not – but you started with this one example of two things, like with your friend at the Marriott. Okay, we've got the East Coast data, we've got the West Coast data; how do we know what's what? And now you're talking about every one of the billions of people in the world... This massive in integration of data and we're definitely gonna need to know what's what. So we're gonna need robust identification, number one, first and foremost. And I just love, I don't know if this is said somewhere.... I imagine I'm not coining this, but I couldn't help but think of people talk about the internet of things and so many things, but what we're talking about is the internet of people. Rather than connecting where there's data governance in devices, it's data governance with people; and people have their own personal data and that can flow around. So it's the internet of people.
Donny: So, I wanna wanna close this up. I have two questions that I wanna ask you at the end. One, one very broad, which is: do you have any general advice for our listeners? And then this can be anything, it doesn't have to be related to handles. It could be general life stuff, but it could also be be that. And then, after that, I wanna know: who should I invite next?
Christophe: Oh, well that's a tough one. The general advice is: stay open to technologies and try to find the ones that solve your problems the best. But ask yourself hard questions. Are you doing it because it's the simple solution or because it's the right solution? And it may be semantics at some degree, but I think, sometimes, we tend to take the expedited solution easily without thinking about the long-term ramifications. And with that in mind, now we have to think about the internet as a system that needs to persist. I'm not talking about the routers and all this stuff. I'm really talking about what goes on on the internet; the discussions and the data. And we really need to start asking the question: how do we make this last? Do we wanna make this last? These are difficult questions. But you have to keep yourself asking. Otherwise, 50 years of history is going to look very muddy in a thousand years. And you say: what happened between 1990 and 2050? We don't seem to get anything. So, you know, people are putting up a good fight to try to do this. But I would really ask this question: what technology do you think will last beyond five years, 10 years? If you're a developer, is what you're doing now... Does it have staying power? I mean, it could be totally disposable software. I'm afraid to say most software is totally only disposable, but the stuff that you do with it... Are you making it so that somebody can understand in five years on... I mean, you write code, read code that you wrote five years ago, and most of the time you say, "what? Did I write this?" I think data is even worse. So, that's the question I would have people think about: are we helping digital information over the long term? Are we really being good stewards of this community that we are at the moment? Because we're so connected over the internet...
Christophe: But I have a dependent – some family members who are challenged in this area – and I always wonder when they're no longer able to take care of the stuff. It's a daunting task. And I'm a technologist, so I have an idea of what to look for, but people who do not... This idea that you would lose your parents' pictures on their computer because you don't know where they are... you couldn't get them backed up... We're at this transition point where we still have a lot of printed pictures from our parents, but, when we are parent grandparents, I'm not sure what we'll do with all our family heritage. And who should do this... So, this is a very open-ended answer. I'm sorry to say. It's not really specific, but I think keeping up with technology is a full-time activity these days. And, and you really have to take an active part in, in shaping the future because we depend on it.
Donny: I really appreciate that zeroing in on a really specific concrete thing. People can think about their pictures. Yeah, their pictures. It just varies concretely. Thank you.
Christophe: And as far as inviting. Well, in the FAIR community, there are definitely many interesting actors. There's a lot of people within the DOI foundation community that I think you may want to talk to because they have a wide body of data and a fairly wide community that's expanding in nature. So, DOI – I typically use [DOI] to identify scientific journal articles, but nowadays they're identifying movies, they're identifying building parts, IoT is next... And this opens up a large, interesting thing. So I would suggest that you maybe invite the DOI foundation executive director for a chat.
Donny: Okay. I can reach out to reach out to them and say, Christophe recommended you.
Christophe: Yes. Yes, sir, I certainly will. But I think that would be an interesting... From a technologist to something that is really socially active and is an active social community that is prob probably going to be one of the front-end on the FAIR beach hand, if you will; because they will probably have to see how FAIR can be applied to all of the data that they reference. So, I would really think that that would be an interesting podcast. Okay. Thank you.
Donny: Thank you, Christophe. Alright folks, that's it for today. I'm Donny Winston and I hope you join me again next time for Machine-Centric Science.