Machine-Centric Science | Transcript: Sandra Gesing

February 17, 2023 • 41 Minutes

Sandra Gesing

(This transcript was auto-generated and then outsourced to edit for clarity, but I have not fully reviewed it. If you notice errors / confusing bits, please let me know: donny@polyneme.xyz)

Donny Winston:
Hello, and welcome to Machine-Centric Science. My name is Donny Winston and I'm here to talk about the FAIR principles and practice for scientists who want to compound their impacts, not their errors. Today we're joined by special guest, Sandra Gesing. I know Sandra through a research data alliance, RDA working group, briefly getting to know her better and very excited to have her on the show today. Sandra, welcome. And could you please tell us a little bit about why I'm so excited to have you on the show today?

Sandra Gesing:
Oh, thank you for your kind invitation. This is really exciting. And yeah, I hope I can show that it's also exciting for the listeners and for you to have me on the show. So I'm Sandra Gesing. I am a senior research scientist and the lead at the Discover Partners Institute at the University of Illinois Chicago in the system. And yes, so my research is about science gateway-based computational workflows, distributed computing. And science gate is our concepts, which is research software to make it easier for scientists and educators to really concentrate on their work, on their research questions, and instead of getting into nitty-gritty details of all the technologies. And then we come really to FAIR.
So for everyone who doesn't know what FAIR is, it's findable, accessible, interoperable and reusable. So that it is really a big hype around the FAIR principles, rightfully so. And it started with digital objects and first everyone looked at data and now we look also at software and workflows, and that makes it a little bit complex. But also, yeah, because they're the rules for software than for data for the different principles. Because software has other characteristics but that some are also overlapping, like having a DOI, for example. But interoperable software is different than interoperable data, for example. So I think that we have these FAIR principles for software and workflows and really look at the different characteristics specifically for these digital objects is really important.

Donny Winston:
Yeah, I definitely agree. I'd also like to highlight for people, I'm also excited... A fun fact; I know Illinois is sort of the birthplace, that area of the original Mosaic web browser. So that was kind of the original gateway to the web of documents. So it very appropriate that you're working on the gateway to more advanced semantic stuff for science. So FAIR data, I'm used to the idea of data as static or I can pretend that it's static. I can be asked for a data set and I can get that and I can hold onto it and I can maybe get the SHA-256 checksum of that bit sequence and okay, I've got this data set. Software a little more elusive, because oh, I can have this version of the software but maybe there's enhancements or bugs. So I actually really definitely want to keep track of updates to it.
I wanted that for a data set too probably, but even more so if there's some logical issue I want and then for a virtual research environment for a gateway, even more so. How can I actually really know what version of what is there. So I feel like gateways are a real stress test of the FAIR principles and getting that in. So I'm curious, what have you seen in your experience, I know you're involved with some major gateway initiatives. In terms of applying the FAIR principles that maybe seem more naturally suited for data to workflows. So to maybe part of your work with the RDA working groups, but also just your professional work at your gateway trying to serve the scientists. What are some of the challenges that you've seen and how you've chosen to address them and maybe how that's going?

Sandra Gesing:
Yeah, so I totally agree to that science gateway virtual research environment stress it all because they have these layers of, there's a software framework that provides the environment, that in an instance of a virtual research environment you have workflows and data and tools and all these different things. And what I've seen with workflows, especially not only in science gateway but in general is even though they're made to really make research and computational methods reproducible, because that's the idea. You want a workflow really like okay, I can give different input, data into that, I can... So that groups can reproduce results or additional results, or produce additional results with the right methods for something with the right scientific methods.
And so even that should be... But being really reproducible is still a big topic also in workflows. It sounds so easy to say, oh, if you have this workflow then it should always produce the same it resides. That doesn't happen because a workflow consists out of data and in a specific order with tools. And if data changes, if tools are changing. There was, for example, a publication a couple of years ago that Taverna, which is the science gateway which is workflow. I looked into whether between the different Taverna instances resides are reproducible and it was not because there might be just a tool create a different result because it's in a different operating system and only maybe minor changes, but even between instances of the same workflow tools, it was not always the same with that.
It was not easy to reproduce. And so we have to be very careful there and some solution is, for example, to have workflows in containers and Docker containers that you have the whole environment that doesn't change or you really work with some tools like Repro, which are also packing the whole environment and you can ship it so that you can unpack it and also start again with exactly always the same environment. So that is really why, but reproducible is really hard to do. It is not easy to have workflows. Even the logic is there. So on a high level, theoretical level you have this, oh yeah, we know what we are doing there. But then you come to the practical level with a different versions of tools, with maybe different operating system and then it gets... It comes again into that moment where not everything is reproducible.

Donny Winston:
Okay. Yeah, thank you. I like that idea of... Yeah, it seems like the idea of containerization and Docker's work has taken the world by storm in terms of a way of I guess equivalent for data and software to essentially have a bit sequence that here's a file, a Docker image that should have everything that I needed. Stepping back a little bit in terms of the motivations and the need for gateways, I'm thinking for example if I want to reproduce some big analysis that involves lots of data in software, maybe theoretically I could produce a Docker image that contains everything that I need, but that Docker image might be a petabyte, if it really actually contains everything and then I need to run it.
And so, maybe I don't want to just use the couple cores on my laptop and the meager RAM that I have and I want to mount volumes for intermediate results. So I maybe need a big hard... So even if I could reproduce it, I can see the motivation of just having these gateways having access to these HPC resources and clouds and et cetera. And so I imagine that complicates the ability of someone to truly reproduce something because they might not necessarily be able to under their own hardware. So I'm curious, what are some alternatives for people that you see in terms of having trust in the reusability and reproducibility and fidelity of these workflow runs in situations when they probably can't feasibly reproduce the whole thing themselves, even if it's technically available, just because it costs money and facility.

Sandra Gesing:
Absolutely, absolutely. So there is really also another group working on a little bit this translational science concept. How can you really prove between software workflows and the theory that the software and workflows do the right thing. That is still really a big topic. So we met two years ago, two and a half already, the pandemic. So it was a face to face meeting, so it has to be two and years. Exactly. So we looked into how can we really prove? So if we have the experts who understand the science and say, okay, I really need these simulation tools or these workflows to do exactly this, what this concept expects to solve and how can we prove that someone who even if they follow exactly the instructions and understand also the theory and still that the software does what it's expected to do, how do we prove that the results are really what you want to see?
It's still a hard problem. So that is really... So therefore I've seen that in science often that people really refer them to peers or who say, okay, this result makes sense. And then really refer with trust to PSA trust or to papers where they say, okay, that has been peer reviewed, I can trust this. And the other thing is, as you said, petabytes. So the other thing is really also you cannot always run everything again. I work with molecular assimilation, some runs for three months. You don't do that again because of course there are checkpoints in between and all these things, but you don't run that again just to prove that it's really reproducible, that nobody has time for that, nobody has the resources for that.
So I think what we do is really as soon as a message or what I've seen, you first try it with a little test data set to see whether the results make sense and you make a peer review out of it or you work with colleagues who are also expert on that and you work together to see, especially with new tools, does that really make sense? And that is how it goes. And so reproducible... So we asked this question in one of the RDA interest groups, the virtual research environment interest group, how people develop trust to methods because you can not be... Especially inside gateway on purpose, a lot of things are hidden to make it easier that you don't deal with the technology. But that also means you have to trust it. And a lot of researchers that I trust, the handful of colleagues around me who have the same expertise or I work a lot with, if they say it's okay, I'm quite sure that I can use it and that it's okay. So it comes to the human factor, that is what I have often seen, or heard.

Donny Winston:
And trying to make that human... I'm curious about efforts to make that human factor, to translate it to the FAIR principles as much as possible. So there are lots of people, trust is a huge topic. And I guess when it comes to FAIR data or even small-scale software, you don't necessarily have to trust the result. If you trust the data and the software integrity, then you can run it yourself and you can trust the result that you got on your computers. But it seems like with these gateway things, yeah, you need to trust that someone else ran it correctly because you don't control the hardware.
And I know there are lots of initiatives in the digital space, or for many years are around trust. A lot of them rely on cryptography, the idea of digital signatures, and just things where you can distribute trust over networks in ways that are FAIR. And so it seems like this issue of verifying trust might be particularly an issue with workers. I'm curious about your experience with that. Is it still pretty much all talked to someone and asked them and someone emailed you and then that's good, or has there been any use of digital trust mechanisms, or do you see that in the future towards providing this, this necessary means of people being able to reuse without having to reproduce?

Sandra Gesing:
Yes. So there are a couple of things. For example, Galaxy is a big science gateway with these workflows. And they have an instance that is publicly available. So you can get a user account, you can not change anything in the workflow itself. You can put your own data and run the workflows. And I know the team behind that and how they really have the expertise for the workflow that they have. So for example, I have a trust workflows in Galaxy in the official instance that it really works. So it could be something like that. The whole team is behind that. So I don't need to know them directly, but I know that they're for years in the space and they have millions of users and if there has really some problem, they would solve it. So that could be one.
It's a kind of a human factor, but also I don't have to know them personally. I just know really the reputation and that it's so vitally used because it's publicly available. People are very careful about that and it's really that they're sharing it. So I think that is part of the trust that it's... If you make it publicly available like that to users and there's a big community, if somebody would found an arrow in a workflow or something they would say the community I think is really very, very welcoming in that space. So the science gateway community on cyber infrastructure, I must really say very welcoming, always focused on make it a better place and not trying to... Really trying to solve problems and not trying to be hostile to each other, let's say it like this. So I really love being in that community.
It's a very, very good community for collaboration. But the other thing, of course there are really interesting trends now to look into peer-to-peer frameworks, for example. That really data you transfer that it's not centralized the security... But that oh, if it comes from that peer and that peer assign that, I can trust it. So that it's really that peers are signing up in a network with each other. And we talked about also blockchain technologies to integrate that in security and do these signatures. So I think there are really interesting trends at the moment beyond really authentication also for access to data and who can really change things and who's in a project. So I think besides the normal security or authentication and authorization mechanisms, there are really interesting trends with a peer-to-peer and blockchain.

Donny Winston:
Cool. Yeah. Sandra, I really like how you... I brought it in with a viewer perspective or focus whatever on the author contributor and are we going to trust by the person and who signed it or the agent. But you brought in a great point, which you get it at the Galaxy community, and I actually got this as part of my work with the Materials Project, a computational material science science project at Berkeley, by having an open shared platform and open code bases. So for example, we were prepared to accept simulation results that were performed by other groups, if they used our open source Atomate workflow library based on this Fireworks, just names of software that we're using. So in addition to like, oh yes, I trust this person, we also mitigate the risk of it being incompatible by, well they're also using the same software we're using, so it's possible that they've misconfigured the environment or whatever. But we're getting more trust because we're using the same thing. And of course, that mode wouldn't have been available if we didn't have that source code available.
And also probably if users of the code base didn't feel like they could contribute to it. So not just open source in a license sense, but open source in a development sense. It's the idea of pull requests and that sort of thing that GitHub is popularized and socialized where you collaboratively develop. And I imagine that that's a big part of the Galaxy, as you mentioned, where you might not trust a particular person or know a particular person, but they've bought into this ecosystem and so there's some trust inherent in that. So I don't know if I'm off base here, but I'd like you to maybe comment on that if maybe the ideas of free software, open source software and collaborative development really play an important role in the verification of gateways and VREs by providing that trust base of the methodology. Yeah, I'm curious if... Have you seen closed source, closed gateways work? Do they all just tend to be open by default? Are there reasons why they keep being open? Yeah, I'm curious about your experience with that having directed and been a bit on part of these gateway initiatives.

Sandra Gesing:
So I think most of it is really open source and for the reason exactly that partially that the academic or national app teams behind it and they share it, they share their software, they would like also their partners to use the software, of course. That is part of to have the bigger community and the science gateway especially it's often people would like to share their [inaudible 00:21:57]. So we did years ago a survey while I was part of the molecular simulation, which science gateway initiative and project and still in Germany. And we asked and we got results from around 200 people where we were like, okay, would you share your tools, your workflows, your data? And we ask it in three questions or your results. So what they did, 80% said they would share their tools on workflows and 80% is a high number. But yeah, molecular simulation gets that is also drug design.
And then it comes also to together. That's really the, let's say, money in there to have... Because it's very expensive to develop drugs and it takes 10 to 15 years to really from the first idea to, or the first simulations that looking at the substances and everything, what has to be tested, until something comes really lose the market. So if you ask this group, they share also to 80% workflows and tools, but there would be... They share the data after it's patented or after they have published it. So because the cost of developing successful drugs is just too high that people feel like they have to protect the data first. And I can see that. And for other initiatives, the goal is to share data. It's like... So that is exactly what they want. They want to share the data, they want to make it FAIR, they want that people know about it. So for example, survey data, often the raw data you can find nowadays, often for example from the research software engineers.
There's a yearly survey how the landscape is internationally and the surveys in every country or every two years, maybe now every two years, I don't want to say something wrong, but the raw data is always directly shared when it's also put out the analytics at what other new numbers are because people want that other people can find it, that they can work with it, that there's outreach for research software engineers. Because what we want in USRSE, for example, the US Research Software Engineer Association is to improve career paths for research software engineers. That they have a place in academia and national labs, that they have real career paths, that they have incentive to stay in academia. And because yeah, we can not pay the Google salaries of the words in academia, but normally we get people in the space because of the working culture, because of interesting challenges and problems to work with. And you can be part of initiatives, then you are not in research. But I got a little bit off track, I think.

Donny Winston:
No, no, that's great. It's fascinating what you said to me, thank you, about this poll where it seems like most of the people in this survey are very okay and positive about sharing tools and workflows, the software and I guess the configuration that will be run by the gateways that the workflows. But in terms of the actual data, that's like well, maybe post publication after it's patented, but maybe never if it's a trade secret. So what's interesting to me about that is it feels like almost that the software and workflow spaces might see less friction more broadly to verification than data, just because they're maybe more likely to have open eyes on it. Fair isn't the same as open, but maybe these communities are more likely to have eyes on it so that people interested in verifying things can find these things because they're open. And so that actually might have more progress and actually lead the way for data. So I find that interesting.
One thing I want to ask you about as well how this factors in, for the past decade-plus we've seen a rise of software and workflow that's partially driven by data, like the whole paradigm of machine learning where you actually have data and that data is the stuff that's going to train your weights and parameters that are going to be fed into your software. And so someone who wants to use the software will be like, well, where did you get all these weights? Can I reproduce this? Can I see the training data? And it's like, well no, that's the data, but no, it's the software. But no, this is the data part of the software. So I'm curious how that's playing out or how you see that playing out. Is there any tension in terms of software workflow that you see that's maybe data-driven and there's sort of acute issues with reproducibility because the data isn't available and that's part of the software. Did you see any tensions with that? I'm just very, very open-ended just thinking about this.

Sandra Gesing:
I totally... Yeah. So I think as you also said, workflow and tools is the one thing that people like, oh, I can share that because the data is what is really driving it makes a different results possible. So some tools, as you also said, simulations are the data created, so there's only a little bit of configuration necessary. And then the simulation really creates the results. So I think there might be some friction really more in these where data is in the medical space or where you really have to be careful because to anonymized data, so before you can publish it, even if it's not about I want to have a pattern first, but more also like, oh, you could find out what person that is.
Or you have to be careful and I think that is a friction, but it doesn't need to be a real friction because I think with FAIR at this point where we know it's guiding principles. So if there's a good reason not to share, then it's okay. Then you can still say, yeah, we can not fulfill exactly everything here, but because of a reason. I think as long as people just put the reason and as a comment that is fine, then you can still say for this data set, it's totally fine. It's FAIR in that way that we can not make it more FAIR... Sorry. So I think as long as there are comments for that and there is a reasoning.
For example, Schrodinger's a software that is a fantastic software in the drug design space, they don't share their code base, and that is okay. I totally understand that the company there are the money for that with that. But they make a lot of things available also via gateways and everything and for teaching. So it's freely available, at universities are almost free. I don't know the license model at the moment, it's a while ago that I looked into that. But they have really fantastic software, so that means they make it available for certain purposes. And in that regard, I would even consider it FAIR then because then yes, they don't have the open source, but they have a good reason. That's their business. So I think it doesn't have to be a friction as long as there is good reasoning behind it.

Donny Winston:
Yeah, great. And thank you for bringing the other way. That was very one-sided of me to think, oh yeah, software workflows are open, but it's like, no, no, no. Look at Schrodinger. I'm familiar with the computational material science world is a very popular [inaudible 00:31:10] simulation package, VASP that does DFT that's closed source. And yeah, I don't think FAIR is necessarily incompatible with open. I guess the idea there is to have nice [inaudible 00:31:24] of well, we used this version of the software and maybe even the checksum of the code that doesn't reveal the code, but you know if you have the code, you know if you don't have the code that we had, so there's that. Pulling back a little bit, actually just related, as a gateway provider, I know a big benefit is like, well we really can't do this on our own systems, but I'd imagine there's a tension between well also, we don't want our data to leak because we want to keep it to ourselves so do I want to upload it to this gateway?
All of that stuff. And so, I'm curious what kinds of things you have with that. I know there's the FAIR paper you have, Barend Mons has said things about data visiting and the idea that maybe the gateways come to the data or essentially. We have the data on our premises, but all the logic and workflows can come to us rather than us moving the data around. And one thing is just practically, there's just a lot of data, but also maybe I don't want to send my data to you. I'm sorry. I'm curious, have you experienced any of this hybrid approach where maybe a gateway comes to user infrastructure or just alternatively that theme of how gateway can technically, but maybe also socially ensure a mitigated risk of data leakage and stuff so that people can feel comfortable using this gateway and not like, no, I need to buy my own cluster because no, I can't send to the gateway. So just in terms of that adaptability, while keeping things FAIR, while ensuring that things aren't as open as people don't want them open.

Sandra Gesing:
So at the beginning also when cloud came out and some of the researchers I talked to, they didn't trust the cloud. It was like this, no, I don't know what happens there and I don't know what is behind it and I totally get it. The trust needs to be first a earned, and I get that. So there are two points, two sides also to it. One is trust, but the other thing is you said. There might be a really petabyte of data. It's just really cheaper at us to ship maybe a container with the tools to the data than trying to get the data over to a gateway. So I have seen approaches, for example, David Abramson in Australia, he got an innovation award for that. So what he would do, he would use containers and each container would contain, which have a Galaxy instance and install. But the users could do, they could started only for themselves in the cloud. It was their instance. No one else has access to it and use it with their data.
And as soon as the workflow was finished, the instance would be killed by the system and they could download their data. Or maybe they kept it there, so depended on how big it probably was or what data they needed. But that was one of the ideas to really split up the whole science gateway where the data is, and then to kill the instance when it's not needed anymore. And every user could do it for themselves, so that approach is really very cool for people. And that at the university so university users trust their environments there. I think that there are a couple of approaches really, especially again with containers that you can ship them around where the data is. And what I've seen now with cloud, people are so used to cloud from their daily lives now that the, there's also more trust. And not that I would say the situation is better than at the beginning with cloud regarding of security mechanisms or something.
It's just only that it's for users now. Oh, they do that with their photos. They put their photos in the cloud, that it goes into the daily life sometimes lowers the hurdle also for the science or for education because it's like okay, if I trust them with my private photos, I can trust them also with that data. So I think that helped a lot that it was really, that it's mainstream now and people are more aware of the other side. Also as you said, data is a big thing of course. And everyone knows now that it's this big data, there's so much data nobody can really analyze every data. So really data is valuable, yes. But people get also more just like, okay yeah, there are hackers. We have to be careful and you always have that side. But you also know, okay, at the university, at national labs at least the infrastructure is also protected and the people behind who really protect my data. So I think the situation got better. Also as I said, because cloud is part of our daily lives now,

Donny Winston:
Yeah. It's been socialized. It reminds me of Brian Nosek's pyramid open source where something media infrastructure to make it possible, but then UX makes it usable, but then of course needs to be normalized and then eventually you can get to the point of policy where it's like, okay, this has to be FAIR. It's like, well, we have to be used to the idea. And so yeah, I definitely see the role of people in their everyday lives doing that. I also love that idea of this ephemeral Galaxy instance that comes to your data. And it makes me think of a broader trend in what of tech known as serverless, which isn't serverless, of course, it's just that a server spins up ephemerally and does one thing.
And so I'm imagining the serverless gateway or the on-prem local serverless gateway that that's interesting. Yeah, this is great. Well, I want to be mindful of our time about at the half hour wait, which seems like a good time to do. We've really discussed a lot. I really enjoyed talking with you a lot, Sandra. I'm curious, a couple of questions I try to leave my audience from my guests. Number one, do you have any advice for our listeners? And this can be very broad, it can be very specific and pointed to the idea of FAIR for various science gateways, but it could also be very, very broad, as you alluded to with the idea with cloud photos. It could just be like kind of anything, any broad advice that you'd like to share.

Sandra Gesing:
So yeah, a broad advice would be, if possible, I love the listeners maybe to get into Open Science, if that's possible. I think it's really important to share with everyone or research results. As I said, if it's possible, we talked about the exceptions, but because I think it's important also to get more equality in the world because we get paid already for the research we are doing. So I don't like it when we had paid boards, Third World countries have not the money than to see the research [inaudible 00:40:15] something. So I think Open Science is very important. And yeah, get into that, if you can.

Donny Winston:
Open Science, look it up. All right, great. Thank you.

Sandra Gesing:
Thank you.

Donny Winston:
Yeah. And then second question, final question, who should I invite next on this show?

Sandra Gesing:
Oh. There come a couple of names to mind, to be honest. But I think it would be also a really interesting conversation partner for this show, Mike Zentner, who's the director of the Center for Excellence for Science Gateways [inaudible 00:41:00]. He has also interesting takes on these topics, yes.

Donny Winston:
Cool. Thanks. Great. Thank you so much, Sandra. All right, folks, that's it for today. I'm Donny Winston, and I hope you join me again next time for Machine-Centric Science.