Unlimited Data, Unlimited Possibilities: UF Health and NVIDIA Build World’s Largest Clinical Language Generator

Researchers plan to use SynGatorTron to develop better AI for rare disease research and clinical trials, as well as to reduce dataset bias.
by Anthony Costa

The University of Florida’s academic health center, UF Health, has teamed up with NVIDIA to develop a neural network that generates synthetic clinical data — a powerful resource that researchers can use to train other AI models in healthcare.

Trained on a decade of data representing more than 2 million patients, SynGatorTron is a language model that can create synthetic patient profiles that mimic the health records it’s learned from. The 5 billion-parameter model is the largest language generator in healthcare.

“Synthetic data isn’t actually linked to a real human being, but it has similar characteristics to real patients,” said Dr. Duane Mitchell, an assistant vice president for research and director of the UF Clinical and Translational Science Institute. “SynGatorTron can, for example, create health records of digital diabetes patients that have features just like a real population.”

Using this synthetic data, researchers can create tools, models and tasks without risks or privacy concerns. These can then be used on real data to ask clinical questions, look for associations and even explore patient outcomes.

Working with synthetic data also makes it easier for different research institutions to collaborate and share models. And since the amount of data that can be synthesized is virtually limitless, researchers can use SynGatorTron-generated data to augment small datasets of rare disease patients or minority populations to reduce model bias.

SynGatorTron was developed using the open-source NVIDIA Megatron-LM and NeMo frameworks. It’s based on UF Health’s GatorTron model, announced last year at NVIDIA GTC. The models were trained on HiPerGator-AI, the university’s in-house NVIDIA DGX SuperPOD system, which ranks among the world’s top 30 supercomputers.

GatorTron-S, a BERT-style transformer model trained on synthetic data generated by SynGatorTron, will be available for developers next month on the NGC software hub. 

SynGatorTron Opens Gate to Robust Training Data

To a doctor, an AI-generated doctor’s note can appear impractical at first glance — it doesn’t represent a real patient and won’t read as logical to an expert eye. So a clinician can’t make a direct analysis or diagnosis from it. But to an untrained AI, real and synthetic clinical data are both highly valuable.

“SynGatorTron’s generative capability is a great enabler of natural language processing for medicine,” said Dr. Mona Flores, global head of medical AI at NVIDIA. “Synthesizing different types of clinical records will democratize the ability to create all sorts of applications dependent on such data by addressing data sparsity and privacy.”

Once it’s available, research institutions outside UF Health could fine-tune the pretrained SynGatorTron model with their own localized data and apply it to their AI projects. For example, if a given condition or a patient population is underrepresented in a health system’s clinical data, SynGatorTron can be prompted to generate additional data with characteristics of that disease or population.

These AI-generated records could then be used to supplement and balance out real healthcare datasets used to train other neural networks, so that they better represent the population.

Since synthetic training datasets mimic real medical notes without being associated with specific patients, they can also be more readily shared across research institutions without raising privacy concerns.

“When you have the ability to mimic population characteristics without being tethered to real patients, it opens the imagination to see if we can generate realistic datasets that allow us to answer questions we couldn’t otherwise, due to constraints on access to data or limited information on patients of interest,” Mitchell said.

One potential application is in clinical trials, which often divide patients into treatment and control groups to measure the effectiveness of a new medication. An application derived from SynGatorTron-generated data could parse through real records and create a digital twin of patient records. These records could then be used as the control group in a clinical trial, instead of having a control group derived by giving real patients a placebo treatment.

Researchers developing a deep learning model to study a rare disease, or the effects of a treatment on a specific population, could also use SynGatorTron for data augmentation, generating more training data to supplement the limited amount of real medical records available.

Healthcare at GTC 

Register free for GTC, running online March 21-24, to discover the latest in AI and healthcare. Hear from SynGatorTron collaborators in the session “A Next-Generation Clinical Language Model,” taking place March 23 at 7 a.m. Pacific.

Watch the replay of NVIDIA founder and CEO Jensen Huang’s keynote address below: