In this episode of Expert Insights for the Research Training Community, Dr. Susan Gregurick, computational biologist and director of the Office of Data Science Strategy, discusses the development of computers, the internet, networking, and analysis platforms, weaving in her personal journey. She then describes the role of data and computation sciences in combating COVID-19.
The original recording of this episode took place as a webinar on May 13, 2020, with NIGMS host Dr. Ming Lei. A Q&A session with webinar attendees followed Dr. Gregurick’s talk.
Recorded on May 13, 2020
View Transcript Download Recording [MP3]
Announcer:
Welcome to Expert Insights for the Research Training Community—A podcast from the National Institute of General Medical Sciences. Adapted from our webinar series, this is where the biomedical research community can connect with fellow scientists to gain valuable insights.
Dr. Ming Lei:
Good afternoon. And to those of you at the West Coast, Alaska, or Hawaii, good morning.
My name is Ming Lei. I am the division director for Research Capacity Building at NIGMS, and I am your host today. It’s a pleasure to welcome you to the fifth webinar of the NIGMS training webinar series.
This is a difficult time. The pandemic has disrupted everybody’s life to a certain extent. NIGMS created this webinar series to help keep our training community together with useful and interesting talks and conversations, and I hope you are all enjoying them.
A few reminders before we start the presentation. All webinars in this series are recorded and some of them already have been and all of them will be posted on the NIGMS website, so you can view them, actually all of them, at any time, and I would encourage you to ask your friends to view them when they have time. Secondly, there will be a Q&A session after the presentation.
And our speaker today is Dr. Susan Gregurick. Because she will share her scientific journey with you, so I am not going to introduce her except by sharing with you that as the NIH associate director for data science, she is the leader at the center of all NIH data science activities.
So with that, Susan, take it away.
Dr. Susan Gregurick:
Thank you so much, Ming. And to all my friends at NIGMS, it’s a pleasure to be here today.
I’ve been so excited and looking forward to this particular journey and discussion with you for a week. It’s not often that I get to tell young people about my own personal journey, and I hope that you see yourself in a little bit of me and what I’ve done. I’m going to tell you about what I’ve done in the computational sciences, which is my true love, and how this has helped shape biomedicine and my own personal professional choices.
So let me begin by telling you the beginning.
So I want to give you some historical perspective in the development of computers, computer science, the internet—I actually watched the internet’s birth more or less—networking, and analysis in my own personal journey and how these have changed my professional life and have helped me make my career choices.
And then I want to finish on something that’s relevant to every single one of us around the world: How have we applied computing, internet, technologies, analytics to address COVID-19? And we’re just at the start of COVID-19, so we have a long ways to go.
So we’re going to have to go way back in the way-back machine to the 1980s. So the top song of 1982 is “Physical” by Olivia Newton-John, which you may or may not have ever heard, but I’m sure you have seen the movie E.T. It was the top-grossing movie in 1982. And I’m living somewhere in a town in Michigan, and I’m a dancer. I actually take ballet as well as highland dancing. I am a total goof-off. I am probably in and out of school more than most. I’m a DJ at our local high school radio station.
My name is Susie at that time. I’m the homecoming queen. And I’m a total closet geek. Nobody at my high school knows that I’m an avid reader of science fiction. I’m reading Scientific American, which was about at my level in high school. I’m taking classes at the local community college—mostly in genetics and chemistry. I’m really fascinated with science, but that was my secret life.
And here’s what computing looked like to me in 1980. So popular then was the Commodore 64. I came from a community and a town where computers were not very common, so even my high school did not have any computers, but the local community college did. This is a typical computer science room that I never got to visit when I was in high school, but I’m pretty familiar with these.
And if you have never seen these before, these are punch cards. And so when you write a computer program in 1980s and before, you have to translate them into the punch card system, and then you feed those punch cards into a machine that’s not really quite visible in my picture.
And the worry of every computer scientist at that time was that you dropped those punch cards Because they’re a program and they’re in order, and if you drop those punch cards, you will spend a significant amount of time and worry trying to get them back in order. Just imagine trying to debug your program using a punch card system. It was so hard to do, so much work.
And when I was in high school, this was one of the computational biology highlights of 1982. By the way, 1982 is the year that I graduated from high school. This is the story of the protein dynamics of a small little tiny protein called BPTI, bovine pancreatic trypsin inhibitor. It’s approximately 60 amino acids long, and you can see its ribbon structure on the screen.
Wilfred van Gunsteren and Martin Karplus did molecular dynamic trajectories of this little tiny protein for 25 picoseconds in vacuum, and then they put it in a spherical shell of 2,647 non-polar waters, and then they fixed it in a crystal structure and they tried to understand the dynamics of movement of this protein in these three scenarios, and that particular paper and that particular simulation was a tour de force of computational biology the year I graduated from high school.
And I was totally amazed that we could actually do calculations of protein dynamics in these three different scenarios—in vacuum, in non-polar solvent, and in crystal image. So moving a little bit forward in the later 1980s, the top song is “Walk Like an Egyptian.” That was the song when I was in college as an undergraduate. The top-grossing movie of 1987 is Three Men and a Baby, actually a movie I never saw—it’s not quite my interest—and I am at the University of Michigan and I’m an undergraduate. And I graduate in the year 1987.
I’m a chemistry major and a math major. It’s not uncommon probably for most of you to have dual majors. I am a research assistant. I am a research assistant in mathematics. I’m a research assistant in geology. And I’m also a research assistant in the medical school, where I am developing hepatic imaging agents through synthetic organic chemistry, and that is not my strength.
I do not do any more synthetic organic chemistry after undergraduate, but at that time I thought that would be an interesting type of research to explore. I’m also spending lots and lots of time looking for errors in my code.
I want to just make one point to you as many of you are undergraduates. One of the most valuable experiences that you can gain as an undergraduate is to work in a lab. To work in a lab, to work with graduate students, to work with other graduate students, to work with postdoctoral researchers and your mentor, your PI mentor, will allow you to see what research is really like. You know, it’s hard.
Sometimes you spend a lot of time working on a project and it doesn’t go anywhere. There’s a lot of false starts. This is one of the most valuable real-life learning experiences that you can have, and I encourage everybody to take at least one semester and do research in a laboratory. And, obviously, I am no longer a closet geek; I am an actual geek at the University of Michigan.
I am known mostly in the chemistry and math department, but I do have a lot of work that I do in coding as well. And what do computers look like for me when I’m an undergraduate? This is one of the computers that I worked on. It’s not my actual computer because I didn’t take that with me. This is an IBM PC/2, and you can see that you can actually play chess on this computer.
This is The Ohio State University, a big competitor to Michigan, by the way. This is the supercomputing center in 1987. They are a powerhouse of supercomputing. They are not the only ones, but I knew them well. And this is the birth of programming languages. While you’ve probably heard of Fortran, that was my primary language when I was coding in the late ’80s. C++, certainly, but PERL and these more interpretive and dynamic languages really start developing in the late ’80s. What’s the computational highlight from the year I graduated from college, which was 1987? It is another computer simulation.
This is the diffusion of a substrate in an active site in an enzyme. And this particular system is superoxide dismutase. And what I wanted to show you is that unlike the last simulation, which is the dynamics in a trajectory sort of way, these are more stochastic Brownian dynamic simulations, and what was really super cool about Kim Sharp and Barry Honig and Robert Fine’s work is that they actually put the charges in the active site of the enzyme into the calculation.
And having the ability to have molecules have a charge gives you an electrostatic [unintelligible] for what’s really happening in that active site. And to me this was just a super cool simulation. I love the work of Barry Honig. I’ve followed him for years, and I have watched the field of electrostatic calculations go from point charges to probability charges to all sorts of really innovative work, and so I just wanted to share with you that one particular highlight.
Moving to a new decade—1990s. The top song in 1995 is “Gangsta’s Paradise” by Coolio, featuring L.V., and the top movie, which I did see, is Batman Forever. All those Batman movies are so great. And I am at the University of Maryland. And, obviously, I have never left this area. I am still living in Silver Spring today.
I am defending my PhD thesis in 1995. Just a side note. I took two years off between my undergraduate and my graduate studies, and I worked at the Naval Research Laboratory, where I was involved in the physical characterization of organic molecules used for blood surrogates. And it was a really wonderful experience because I got to see what it was like to work in a very large team at the Naval Research Laboratory, and I got to become much more proficient at NMR spectroscopy and IR spectroscopy and Raman spectroscopy, and I so loved Raman and IR spectroscopy that you’ll see it popping up in my future.
You see this character here on the giant steps. That’s my PhD thesis advisor. That’s Millard Alexander. He’s still at the University of Maryland. I think he might be emeritus at this point. But what did we work on?
So I studied flux in reactive systems—systems like boron hydride—and I studied what happens in those systems when the potential energy that describes different excitation states cross and how do you actually calculate curve crossing or reactions? That’s really the story of flux.
I developed a new genetic algorithm, which is a pretty cool algorithm, for optimization of structures that have multiple potential energy surfaces, PSEs, and obviously I’m not in computational biology. I am a serious homebrewer, and I got married to my colleague in physics.
And this is a later picture, but that is myself and my husband, Nicholas Phillips. When I was a graduate student, I wanted to change careers. I wanted to think differently about computation and what we can do with our careers, so I changed from physics to computational biology.
Here’s what computers looked like in the 1990s. This is actually a computer that I did most of my PhD work on. It’s an Apple Macintosh. I was so lucky to watch the birth of Mozilla, Netscape, and a little blurry for you is the HTML language that most of you probably know how to program in and you’re very, very efficient in. But when I was in grad school this was completely new—and so was this.
At one point, a list every day came out, a new website, and there was a list of the top websites that had come out that day. And the first webcam, that’s the coffee pot at Cambridge, where I actually visited and did some work as a grad student. There it is.
You could see the level of the coffee pot at any particular time and you would know when you could go down and get some new coffee. And here I’m going to play for you in the way-back machine the sound that I will never forget [dial-up handshake]. There it is. And that horrible sound goes on and on. That is how we had to connect to the internet. That is my dial-up modem.
So I had to sit at home and timeshare the one computer in our grad school house, dial up to the internet, and do our work. And most of us actually played games, and we had to have a lot of time in order to do our work and play games. So you guys have such a wonderful experience—always connected, always on—but for us, that was the sound that we heard hour after hour throughout the night.
Here’s something that was super exciting when I was early in my graduate school days. This is BLAST—Basic Local Alignment Search Tool—developed by a number of colleagues, including David Lipman. David Lipman is still at NCBI and NLM here at NIH.
This was a new approach to rapidly do sequence comparison of different sequences by doing a basic alignment. And you would get a score, and that would tell you, for example, where the gaps were, where the insertions were. This particular algorithm has revolutionized the way we do comparative genomics, and now you can do slide BLAST and multi-PI BLAST, and there’s just so much work that’s happened. But yet I bet most of you have used BLAST or one of its child prodigies in your own research, and it was just remarkable.
And this is really one of the reasons that I got inspired to think about bioinformatics and data science, because I started to realize when I was in physics that the world of data and the world of biology and the world of computing were the next big thing, and I think that you might agree that that’s actually true.
In the years since 1995, I have traveled to Israel for a postdoctoral fellowship in computational biology. I was a professor of computational biology at a university, University of Maryland Baltimore County, for a number of years, and one of the projects that I worked on was this super large protein complex called GroEL-GroES, that is a protein chaperone complex, and it’s huge. It’s 14 subunits, but you can’t see it all. It’s all together as a big complex. Each subunit is 58 kilodaltons. I couldn’t even load that complex into memory in my computer when I was working. I had to do very large parallel processing on supercomputers to just do the calculations for how the GroEL-GroES chaperone complex and the proteins that are inside that are in blue actually work.
I switched. I became a program director for the Department of Energy, and I focused fully and totally on data—data platforms, data computing—for energy and the environment, very particularly on bioenergy, translating poplar and other types of soft woody plants into bioenergy complexes. I decided to make that career change because I wanted to have a bigger impact for a larger amount of science, and I truly, truly am dedicated to data science.
I was a division director at NIGMS and I worked for Dr. Jon Lorsch, and I was the director of Biophysics, Biomedical Technology, and Computational Biosciences, and I really wanted to think about how we can change the landscape for technology, incorporating much more new and innovative technology as well as new ideas for team science.
And now I am the associate director for data science, where I am working across NIH and across the community to make data, data resources, findable, accessible, interoperable, and reusable. And I also am the mother of two fantastic young adults, Andrew Phillips, who is a junior in college studying, of all things, organic chemistry, and my daughter, Abigail Phillips, who is finishing high school and hoping someday to have a career in dance.
And I still brew beer. Almost every month I have another five-gallon carboy of beer brewing.
And here we are today.
You have data at your fingertips, and you have wonderful platforms to access and use that data. You’re always connected and you’re always on, and that’s a wonderful thing. And maybe it’s a curse too, but it’s so nice to never have to listen to that dial-up sound. You have supercomputers like we’ve never seen before that can really address problems of great complexity.
The problem I showed you GroEL-GroES chaperone complex could easily be handled today without any special workarounds with massive parallel computing. You have R and it’s Shiny, and you write in codelets that you can match right onto the bare metal with Kubernetes.
And you can package up your code into dockers and containers and move it around to different cloud resources. And you’re working on a community. You have GitHub. You share your software.
This is just an example of Jupyter, but there’s such a great software-sharing community that’s available to you. So how can we use all these tools that we have at our hands today to address a pandemic that’s significant? How can we partner with industry for workflows and tools and analysis? And how can we provide you the resources so that you can get your work done?
I want to just give you three or four use cases of what we’re doing right now at NIH that you can use to study COVID-19. And this is an amazing story of two intramural researchers—one of them from NIDDK and the other from NCI—so NCI is National Cancer Institute, and NIDDK is National Institute of Diabetes and Digestive and Kidney Diseases.
And they, in three weeks, collected specimens from pathology, created the digital images of those specimens, de-identified them, partnered with a company called HALO, and put those whole-slide images up for you to use for reference so that you can study and understand COVID-19.
Right now we have much more than eight reference cases because our two intramural researchers are getting more and more samples every day from hospitals from different countries, so I think we’re up to 19 reference cases, but there’s more coming in every day.
And we’re going to integrate this particular resource into a much larger resource in the near future, but just right now you can go and do some limited artificial intelligence algorithm development on these resources. And we’re partnering with the gaming and the video company to create processing workflows for CT images. CT has been one of the types of images that you can use to detect COVID-19 in patients, and so we’re developing those workflows by using and leveraging gaming computers. This is a very nice artificial intelligence classification.
And we are providing high-performance computing resources to the federal government, to industry, to academic leaders around the world so that you can use resources from the national labs, resources from IBM, from Google, from AWS. Over 4 million CPU cores are available. The consortium is taking applications every day. So if you have an idea that you think would benefit from high-performance computing, this consortium is there. The resources are free for you.
We’ve come a long way since those days of punch cards and 25 picosecond dynamic simulations of tiny, tiny proteins, and I’m just wondering where you, our new and brightest generation of scientists, will take us in the future.
And with that, I would love to hear from you your questions, your comments, and your thoughts. And I’m going to turn it back over to Ming.
Thank you. Thank you so much, Susan. I will say with computers, beer, and the lovely family as a very exciting life. So as I mentioned earlier, we are going to have questions. I will ask the first one on behalf of our audience. So for a biology major interested in a research career, what would be the key computational and data science training or skills that the student should pick up while he/she is in school?
That is a great question, Ming. I would say that there are a few common ways in which biology is coming to look at data and look at studies that you can start to take classes in now, and that would include getting familiar with the programming language R, because quite a few software tools are written for and in R. But if that’s a bit of a barrier, there are also tools such as Galaxy, which are a little bit more plug and play, and so using the tools available in Galaxy or Jupyter, you can have a lot of different types of computational software like Glass and others.
So getting familiar with those platforms and learning to use those tools and understand what the results mean for your research would be a great step forward. And Coursera is offering many different types of computational classes available for students.
And I think NIH has offerings to make Coursera computational data science classes freely available for NIH students, so we would be more than happy to point you to those resources.
Great, great. There is actually a question from one of the students. Where would I go to apply for access to computer resources?
From the HPC Consortium. There is a website, and the application is processed through NSF, through a program called XSEDE. NSF will route your application to the consortium, and the proposal is very lightweight. It’s only, I believe, three pages, so you can certainly easily apply for those resources, and then they will match the resource needs to the application that you put in, so you can have access to many different types of resources.
Related to that, NIH has training opportunities and resources available as well, right?
Absolutely. There are a number of different training opportunities that I did prepare as an extra slide, including our SRA metadata cloud, BigQuery, and NIAID bioinformatics training resources. All of the resources that I told you about today can be found on our website, including the high-performance computing application.
And then there’s a number of training opportunities that we will be having available, including if you really want to do computing on bare metal, there’s a Kubernetes engine two-day course coming up later this month. There are a number of other opportunities in the works that could be either working with Google, GCP, or AWS. Some new opportunities for machine learning as well as data engineering later in July.
Great. Another question is more about your own scientific journey. How did you decide to change your field, and how did you update yourself with the new field?
That is a great question. And it’s sort of a funny story. I was studying physics, mostly in surface and gas-based physics. And the funding was starting to change when I was a grad student from that physics/Silicon Valley type of funding much more into bioinformatics, and my PhD thesis advisor said, “There are a few opportunities in your life when you can do a career change, and from graduate school to postdoc is one of them. If you want to make a change to computational biology,” because he saw all the articles I was reading, “now is the time you need to do it.”
So I wrote to a number of people to get specifically training from people who were prior physicists who had moved to computational biology, and that is how I chose my postdoc was by working with somebody who had also been a physicist so that we would have some common language. It was a hard change.
I had taken very, very little biology classes when I was an undergrad, and obviously no biology classes when I was a grad student, so I had a huge lift to retrain myself. I was lucky that my postdoctoral advisor was very patient with me as I did have to take additional training and coursework in biology in particular.
And I will be the first to admit that I do not have the strength and background that many of my colleagues at NIGMS have in biology, and I often have to look to them for understanding about the meaning of the systems that I’m trying to study in much more complex detail than I have.
Biology is so complicated, but it’s also so fascinating.
Great. This follow-up question is from a different angle. Do you have advice for postdocs not classically trained in data computational science wanting to transition into the field?
Yes, absolutely. I would take the similar thought that working with an advisor or doing a one-year sabbatical as a junior assistant professor with a colleague who has that training in wet lab experimentation but has also made a transition to computation will help you a lot.
So you might need to take an additional year of postdoc or sabbatical to train in computational sciences but working hands-on in the lab with other people in the computational field will give you a lot of insight. I also took apart a lot of code to learn how it worked, and that is a good way to learn how something works is to take it apart and then try to learn how to put it back together again.
OK, this is a closely related one. What are the computational bioinformatics opportunities as a prospective postdoc at NIH?
There are a number of computational fellowships that one can apply for. There’s also a lot of funding for new investigators in computational data science, and you happen to be looking at the institute that has, I’d say, the largest amount of computational and data science funding opportunities, NIGMS, and so working with them to get funding in one of their programs is absolutely a wonderful opportunity.
Another one related to this, what level of math and statistics would you need to be able to take advantage of the bioinformatic tools you mentioned earlier.
I would say that having a good basic understanding of mathematics and statistics will always help you. In fact, when I was looking at majors when I was in college, I was thinking of double majoring in computer science and one of my colleagues told me that it’s much better to major in math because math is the foundation of most computer science. And that’s true, I see that now.
So having a strong mathematics background can never do you wrong. But if there’s a little barrier, then having a good foundation for statistics will definitely be a very important tool to have in your toolbox.
Another one, what programming language will be suitable to understand computational biology?
I have so many favorites, but probably they’re a little old and outdated now, and what I see is that people find R and R Shiny to be very useful, and many of our professional PIs are writing their programs in R. So if I had to pick one, it would be R, but if you ask me what my favorite programming language is, it’s actually PERL. I loved PERL so much. I did not like Java very much and I certainly didn’t like many of the threaded languages, but I just absolutely loved PERL, but I don’t think that’s very useful. I think R is probably going to be your best bet.
Great. What would be your advice with gaining computational skills you want to incorporate into research rather than enter the field as a whole? What would be the best way for an undergrad to approach a potential mentor?
So you want to approach a mentor and gain experience? I’m trying to understand how to parse that.
The first part is are there ways to gain computational skills you want to incorporate into research but not really want to become a card-carrying data scientist.
I would say learning some of the more popular software tools, like BLAST, for example, is a great tool. Just learning how to use it and what those results mean for your own research would probably do you very well. So you would never have to write any or much code at all using existing software, but it will really help you if you sort of know the basics and know the results and know the foundation of some of those more popular tools.
OK, here is a specific one, which tools would you recommend for cryo-EM image processing to determine protein structures?
I am not an expert in that, but I think there are some tools, something like Cryolan is one tool that I’ve heard, and I believe that’s been forwarded to the cloud and actually I believe that NVIDIA worked on that as well for cryo-EM. There are probably other more popular and better tools; that’s just one that I know about because of the partnership with NVIDIA.
OK, there is also a pretty specific question that is does NIH have open-resource for services such as sequencing samples from patients?
Absolutely. And this may or may not be available to the open community, but our institute NHGRI, our National Human Genome Research Institute, does do sequencing on patients, particularly also right now for COVID-19 as well as we have a national lab in Frederick, Maryland—Frederick National Lab—which is doing sequencing on COVID-19 patients, as well as developing serology testing and analyzing that data. I see that CryoSPARC is another popular cryo-EM data processing tool.
Thank you so much. That must be Mary Ann Wu who has mentioned that. So thank you very much. CryoSPARC is coming up as another popular tool.
So let’s go for another question. Given that sectors such as banking, insurance, often offer much higher salaries to a student with that kind of computational data science training, what would you tell those students so that they would consider biomedical research as a rewarding and viable career choice?
That’s a great question because it’s always on my mind as well. I would say that being an investigator and a researcher in computational biology and studying and understanding biology is rewarding for a number of reasons.
The flexibility that you have in your career and your career choices and the types of work that you do, those are up to you. You make the decisions and you are the captain of your ship, and you make the contributions to science, unlike in the private industry where the captain of the ship is the CEO and the board of directors, and they make a lot of the decisions and you are implementing.
Here, when you are a researcher in an academic setting, you are the one who is discovering and pushing the field forward. And if that passion for understanding, addressing questions, using your skills in computer science or in the wet lab drives you, you will stay up day and night to do it. You will find that the passion you have for research will not be quenched by any lack of money that you may not have by not having moved to industry.
What are some of the big issues you are working on as the NIH associate director for data science?
Right now the biggest issue we’re working on is with respect to COVID-19, and that is that we have to very rapidly create and move an infrastructure to get the data and the information to scientists in such a way that they can use their algorithms to answer really important questions.
Data science requires data, but it requires data to be well formatted, to be well curated, to be annotated, to be in a common model so that we can look across many different organizations, and that’s what we’re working on right now. And we’re spending all of our days, most of our nights, and even our weekends—and not just me, many people at NIH—to move the data into a way that researchers can use it right now.
Which language do you think is best to start learning if she does not have any knowledge of programming language prior to that?
I think the best one to begin with is still probably working in R. I learned Fortran—they don’t even teach that anymore—in college. C++ underlies many of the programming languages that are used, so that’s always a good language to learn, especially if you want to be a heavy-hitting computer science person.
But if you’re looking to pick something up and be pretty proficient quickly, I do recommend looking at R.
Is there a specific platform that is better to take computer science courses online, like Coursera or Udemy? I’m sorry if I botched the names. I’m not familiar with them. Is one better than the other?
I’m much more familiar with Coursera, and we have developed a partnership with them so that we can provide training for a large number of colleagues, so that is the one that I personally know the best and would recommend, but there probably are others. My son is very fond of Khan Academy, and he’s been taking a lot of courses, even when he was in high school, through Khan Academy.
Here’s one question that requires some physician training, Susan. With the transition from in-person to online, what would you recommend for preventing your eyes from tiring due to staring at screens for a long time?
I don’t know if I’m qualified to say or not, but my strategy is to take lots of micro breaks, because I can certainly understand what you’re saying in terms of eye strain. And also sitting down all day is not so good either, so my personal recommendation, and I’m not a physician at all, I’m a computer person, I like to take micro breaks.
I think you have a brewer to take care of, right?
I do, yes.
Does NIH work with the big tech companies?
Indeed. Yes, we do. We have partnerships through our STRIDES program with Google and AWS. We partner with Palantir, which is a very large analytics platform. We partner with NVIDIA, which is a gaming chip developing company. We partner with smaller companies. I don’t know if Halo is super small, but that’s the platform that we put the website up on. So we do partner with a number of tech companies.
We’ve talked to a number of folks who are in the AI space to look at partnerships. We partner with the national labs and with other agencies, such as NSF. We’re looking to partner with sister agencies such as the VA. That’s how science moves forward, is to work together. Each partnership offers strengths, and we have a strength too. We don’t duplicate each other’s work; we partner and together we move science forward.
Do you recommend any data science bootcamps for more structured training?
I have to say that I have a colleague who is in my office, her name is Allissa Dillman, and she runs a number of codeathons and bootcamps, and so I would love to encourage you to take one of her bootcamps. And in order to see which one is running, you have to go to my website, and I just now see that we did not put it up there. But if you go to the data science NIH.gov website, you’ll be able to find the bootcamps that we’re running.
I’ve done a number of jamborees and bootcamps in my past, and I’ve always loved the ones that focused on writing analytic tools for sequence analysis and metabolic pathway analysis. Those are my personal favorites, but she runs bootcamps on sequence analysis; she runs bootcamps on understanding electronic healthcare record data. She runs so many very different types of bootcamps. But I would say that attending one of her bootcamps would probably be a lot of fun. She’s young and much more in tune with where computer science is going than me. I haven’t coded in more than, I don’t know, 10 years now, I think.
All right. I’m interested in learning Python. Do you have any advice on how I should learn?
I only have some experience with working on R.
Yeah, Python. I can just tell you my strategy for how I learned was to get code, take it apart, and then work with it. Put in new subroutines, new algorithms, and see if I could get it to do something new. That’s how I helped my son learn programming, so I would suggest if you’re interested in Python, get some codes written in Python from GitHub and see if you can play around with it.
There’s great books by O’Reilly on understanding computer code at a little easier level, and I would also get one of the O’Reilly books. The Python book is particularly fun. We have that at our house.
I’m interested in bioinformatics with a biology background. I don’t have any physics background. If I want to know more about physics, where should I start?
That’s a great question. There are a lot of primers that you can get to understand some of the underlying physics behind the bioinformatics. Sometimes it’s just helpful to take a paper that you’re interested in and read some of the references or some of the underlying methodology.
So you find a paper that you’re interested in and you see some methodology, then go back to your textbooks and learn a little bit more from the methodology that’s in the paper. Or you could always take a class in physics, although they tend to be not completely relevant to the paper that you’re reading. So that would be my suggestion.
OK, here is one that is more current. What type of information is available in association with NIH COVID-19 samples? For example, is there specific phenotypic information, like GI or cardiovascular symptoms and the severity? Or medications that patients were on prior to infection, such as ACE inhibitors? Is there proteomic or RNA sequence data associated with histological samples you mentioned?
That is a great question, because COVID-19 is such a hydra of disease. It’s been hard for us to get our hands around it, so we’re looking at making and understanding some of the very basic underlying electronic healthcare record data that will tell you about medications, about prior conditions available, but in a de-identified way so that you wouldn’t be able to trace it back to a particular individual, but you could look at correlations from what is presented in the patient who has COVID-19 when they enter the hospital with respect to what they have taken in terms of drugs or in terms of prior conditions.
In terms of proteomics and sequences, we have much less data on that. It’s hard to get those data. The healthcare system tends to be a little bit taxed, and so right now getting proteomic samples has been more challenging and we are just now getting sequencing samples from COVID-19 patients.
Putting all that information together is our grand challenge at this point. We think we can make some of the data available. As you can tell, it’s coming in a staged way because we have the pathology images available right now.
We don’t even have the CT images available for researchers. They are in the queue. They need to be de-identified. They need to be associated with the appropriate standards and metadata so that you can use them. So even getting those CT images is taking a long time.
Getting the other data like electronic healthcare record data de-identified, we hope we can get that done by this summer, but it’s going to take some time. And the sequencing data, that might be even longer, so you can see the struggles that we have just to get the data out for researchers to use.
Great. What do you think of current, state-of-the-art research on protein structural prediction?
I do have a favorite, and I’ve been involved in protein structure, determination of prediction for a while. In terms of determining the structures, certainly X-ray scattering was a popular way to determine structure for many, many years. I certainly worked in X-ray structure as well as neutron scattering, which is not as refined as X-ray.
Now we see cryo-EM blossoming into a real serious research tool for actual atom-specific structures. In comparison, also in protein structure prediction, there was the—I don’t know if you’re familiar with CASP, Critical Assessment of Structure Prediction, competition that was run every two years. So I don’t know what number we’re up to now, but when I was working on it, people were doing homology modeling, so taking a standard and trying to align an unknown sequence to that standard.
They were working on threading. I did a lot of threading. I did a lot of genetic algorithm, protein structure predictions, some molecular dynamics. And then there was the work by David Baker which looked at little tiny windows of protein and mapping them onto existing structures. And that approach seems to have been quite successful.
I think the field is still moving in that direction of micro-threading. I cannot believe I forgot Rosetta. Rosetta, that was his program. I think the field really pushed forward with his revolutionary work in Rosetta, and now I imagine what’s happening is much more looking at artificial intelligence to gain information about higher structures to even move further into what those new structures might be.
So out of initial protein structure prediction, I think the door really opened up with David Baker’s work, but prior to that there was an awful lot of BLAST-type based algorithms.
As we move closer to the end of the hour, the questions are getting more futuristic. Here is one. Do you think physically writing code will be less important in 5 to 10 years when you can use platforms like Galaxy for basic and the translation of biological research?
Actually, I think you’re kind of right. I think that people are producing codelets, little micro bits of code that can be swapped in and out in a modular way. And so my old way of taking a giant code and I had to work on Charm, which is fairly huge, and trying to add subroutines to it, will change to microcoding codelets where you just swap out little bits.
So that’s the idea of Galaxy, and platform-based coding is probably going to be much more standard for many, many folks in the future. I think that computer science is moving in really interesting and fun directions, and I look forward to watching what you guys do.
Good. Here’s a question. Are there any online training courses that include the biophysics branch of bioinformatics?
I would think so, but off the top of my head I don’t have those online courses—although I do know that through NIGMS we have funded a number of big data online training courses, so through the societies there’s definitely training courses, so the Biophysical Society would be a great place to look for those online training courses in biophysics.
Are there more questions? I’ll wait a little bit. Going once…going twice…three times.
Thank you so much, Susan. This was a fantastic hour. I hope everybody enjoyed it.
Thank you so much. It’s been a real pleasure to tell you about my personal journey and data science/computational science and where we are now with COVID-19. And I hope that you will take the opportunity to look at the online training resources that are available and also look at our website, and do participate in any one of these training opportunities offered through our STRIDES partnerships with AWS and Google, or through our NCBI courses and webinars and through the NIAID bioinformatics training resources.
All right, thank you all. Stay safe and be well.
Thank you.