NVIDIA is collaborating with biopharmaceutical company AstraZeneca and the University of Florida’s academic health center, UF Health, on new AI research projects using breakthrough transformer neural networks.
Transformer-based neural network architectures — which have become available only in the last several years — allow researchers to leverage massive datasets using self-supervised training methods, avoiding the need for manually labeled examples during pre-training. These models, equally adept at learning the syntactic rules to describe chemistry as they are at learning the grammar of languages, are finding applications across research domains and modalities.
NVIDIA is collaborating with AstraZeneca on a transformer-based generative AI model for chemical structures used in drug discovery that will be among the very first projects to run on Cambridge-1, which is soon to go online as the UK’s largest supercomputer. The model will be open sourced, available to researchers and developers in the NVIDIA NGC software catalog, and deployable in the NVIDIA Clara Discovery platform for computational drug discovery.
Separately, UF Health is harnessing NVIDIA’s state-of-the-art Megatron framework and BioMegatron pre-trained model — available on NGC — to develop GatorTron, the largest clinical language model to date.
New NGC applications include AtacWorks, a deep learning model that identifies accessible regions of DNA, and MELD, a tool for inferring the structure of biomolecules from sparse, ambiguous or noisy data.
Megatron Model for Molecular Insights
The MegaMolBART drug discovery model being developed by NVIDIA and AstraZeneca is slated for use in reaction prediction, molecular optimization and de novo molecular generation. It’s based on AstraZeneca’s MolBART transformer model and is being trained on the ZINC chemical compound database — using NVIDIA’s Megatron framework to enable massively scaled-out training on supercomputing infrastructure.
The large ZINC database allows researchers to pretrain a model that understands chemical structure, bypassing the need for hand-labeled data. Armed with a statistical understanding of chemistry, the model will be specialized for a number of downstream tasks, including predicting how chemicals will react with each other and generating new molecular structures.
“Just as AI language models can learn the relationships between words in a sentence, our aim is that neural networks trained on molecular structure data will be able to learn the relationships between atoms in real-world molecules,” said Ola Engkvist, head of molecular AI, discovery sciences, and R&D at AstraZeneca. “Once developed, this NLP model will be open source, giving the scientific community a powerful tool for faster drug discovery.”
The model, trained using NVIDIA DGX SuperPOD, gives researchers ideas for molecules that don’t exist in databases but could be potential drug candidates. Computational methods, known as in-silico techniques, allow drug developers to search through more of the vast chemical space and optimize pharmacological properties before shifting to expensive and time-consuming lab testing.
This collaboration will use the NVIDIA DGX A100-powered Cambridge-1 and Selene supercomputers to run large workloads at scale. Cambridge-1 is the largest supercomputer in the U.K., ranking No. 3 on the Green500 and No. 29 on the TOP500 list of the world’s most powerful systems. NVIDIA’s Selene supercomputer topped the most recent Green500 and ranks fifth on the TOP500.
Language Models Speed Up Medical Innovation
UF Health’s GatorTron model — trained on records from more than 50 million interactions with 2 million patients — is a breakthrough that can help identify patients for lifesaving clinical trials, predict and alert health teams about life-threatening conditions, and provide clinical decision support to doctors.
“GatorTron leveraged over a decade of electronic medical records to develop a state-of-the-art model,” said Joseph Glover, provost at the University of Florida, which recently boosted its supercomputing facilities with NVIDIA DGX SuperPOD. “A tool of this scale will enable healthcare researchers to unlock insights and reveal previously inaccessible trends from clinical notes.”
Beyond clinical medicine, the model also accelerates drug discovery by making it easier to rapidly create patient cohorts for clinical trials and for studying the effect of a certain drug, treatment or vaccine.
It was created using BioMegatron, the largest biomedical transformer model ever trained, developed by NVIDIA’s applied deep learning research team using data from the PubMed corpus. BioMegatron is available on NGC through Clara NLP, a collection of NVIDIA Clara Discovery models pretrained on biomedical and clinical text.
“The GatorTron project is an exceptional example of the discoveries that happen when experts in academia and industry collaborate using leading-edge artificial intelligence and world-class computing resources,” said David R. Nelson, M.D., senior vice president for health affairs at UF and president of UF Health. “Our partnership with NVIDIA is crucial to UF emerging as a destination for artificial intelligence expertise and development.”
Powering Drug Discovery Platforms
NVIDIA Clara Discovery libraries and NVIDIA DGX systems have been adopted by computational drug discovery platforms, too, boosting pharmaceutical research.
- Schrödinger, a leader in chemical simulation software development, today announced a strategic partnership with NVIDIA that includes research in scientific computing and machine learning, optimizing of Schrödinger applications on NVIDIA platforms, and a joint solution around NVIDIA DGX SuperPOD to evaluate billions of potential drug compounds within minutes.
- Biotechnology company Recursion has installed BioHive-1, a supercomputer based on the NVIDIA DGX SuperPOD reference architecture that, as of January, is estimated to rank at No. 58 on the TOP500 list of the world’s most powerful computer systems. BioHive-1 will allow Recursion to run within a day deep learning projects that previously took a week to complete using its existing cluster.
- Insilico Medicine, a partner in the NVIDIA Inception accelerator program, recently announced the discovery of a novel preclinical candidate to treat idiopathic pulmonary fibrosis — the first example of an AI-designed molecule for a new disease target nominated for clinical trials. Compounds were generated on a system powered by NVIDIA Tensor Core GPUs, taking less than 18 months and under $2 million from target hypothesis to preclinical candidate selection.
- Vyasa Analytics, a member of the NVIDIA Inception accelerator program, is using Clara NLP and NVIDIA DGX systems to give its users access to pretrained models for biomedical research. The company’s GPU-accelerated Vyasa Layar Data Fabric is powering solutions for multi-institutional cancer research, clinical trial analytics and biomedical data harmonization.
Learn more about NVIDIA’s work in healthcare at this week’s GPU Technology Conference, which kicks off with a keynote address by NVIDIA CEO Jensen Huang. Registration is free. The healthcare track includes 16 live webinars, 18 special events and over 100 recorded sessions.
Subscribe to NVIDIA healthcare news and follow NVIDIA Healthcare on Twitter.