Deep learning applications are data hungry. The more high-quality labeled data a developer feeds an AI model, the more accurate its inferences.
But creating robust datasets is the biggest obstacle for data scientists and developers building machine learning models, says Gaurav Gupta, CEO of TrainingData.io, a member of the NVIDIA Inception virtual accelerator program.
The startup has created a web platform to help researchers and companies manage their data labeling workflow and use AI-assisted segmentation tools to improve the quality of their training datasets.
“When the labels are accurate, then the AI models learn faster and they reach higher accuracy faster,” said Gupta.
The company’s web interface, which runs on NVIDIA T4 GPUs for inference in Google Cloud, helped one healthcare radiology customer speed up labeling by 10x and decrease its labeling error rate by more than 15 percent.
The Devil Is in the Details
The higher the data quality, the less data needed to achieve accurate results. A machine learning model can produce the same results after training on a million images with low-accuracy labels, Gupta says, or just 100,000 images with high-accuracy labels.
Getting data labeling right the first time is no easy task. Many developers outsource data labeling to companies or crowdsourced workers. It may take weeks to get back the annotated datasets, and the quality of the labels is often poor.
A rough annotated image of a car on the street, for example, may have a segmentation polygon around it that also includes part of the pavement, or doesn’t reach all the way to the roof of the car. Since neural networks parse images pixel by pixel, every mislabeled pixel makes the model less precise.
That margin of error is unacceptable for training a neural network that will eventually interact with people and objects in the real world — for example, identifying tumors from an MRI scan of the brain or controlling an autonomous vehicle.
Developers can manage their data labeling through TrainingData.io’s web interface, while administrators can assign image labeling tasks to annotators, view metrics about individual data labelers’ performance and review the actual image annotations.
Using AI to Train Better AI
When a data scientist first runs a machine learning model, it may only be 60 percent accurate. The developer then iterates several times to improve the performance of the neural network, each time adding new training data.
TrainingData.io is helping AI developers across industries use their early-stage machine learning models to ease the process of labeling new training data for future versions of the neural networks — a process known as active learning.
With this technique, the developer’s initial machine learning model can take the first pass at annotating the next set of training data. Instead of starting from scratch, annotators can just go through and tweak the AI-generated labels, saving valuable time and resources.
The startup offers active learning for data labeling across multiple industries. For healthcare data labeling, its platform integrates with the NVIDIA Clara Train SDK, allowing customers to use the software toolkit for AI-assisted segmentation of healthcare datasets.
Choose Your Own Annotation Adventure
TrainingData.io chose to deploy its platform on cloud-based GPUs to easily scale usage up and down based on customer demand. Researchers and companies using the tool can choose whether to use the interface online, connected to the cloud backend, or instead use a containerized application running on their own on-premises GPU system.
“It’s important for AI teams in healthcare to be able to protect patient information,” Gupta said. “Sometimes it’s necessary for them to manage the workflow of annotating data and training their machine learning models within the security of their private network. That’s why we provide Docker images to support on-premises annotation on local datasets.”
Balzano, a Swiss startup building deep learning models for radiologists, is using TrainingData.io’s platform linked to an on-premises server of NVIDIA V100 Tensor Core GPUs. To develop training datasets for its musculoskeletal orthopedics AI tools, the company labels a few hundred radiology images each month. Adopting TrainingData.io’s interface saved the company a year’s worth of engineering effort compared to building a similar solution from scratch.
“TrainingData.io’s features allow us to annotate and segment anatomical features of the knee and cartilage more efficiently,” said Stefan Voser, chief operating officer and product manager at Balzano, which is also an Inception program member. “As we ramp up the annotation process, this platform will allow us to leverage AI capabilities and ensure the segmented images are high quality.”
Balzano and TrainingData.io will showcase their latest demos in NVIDIA booth 10939 at the annual meeting of the Radiological Society of North America, Dec. 1-6 in Chicago.