To better battle cancer, we need data. Lots of it.
With cancer so prevalent, data is abundant. There’s everything from medical records stuffed with the pathology reports of millions of cancer patients to newspaper archives filled with the obituaries of cancer victims.
All this information effectively creates a dispersed database that can be used to determine ties between demographics and population cancer outcomes. But it takes a lot of time to analyze so much unstructured text data. That’s why the Surveillance, Epidemiology, and End Results (SEER) Program of the U.S. National Cancer Institute typically reports its annual cancer statistics with a five-year delay.
To speed things up, researchers at the Oak Ridge National Laboratory’s Health Data Sciences Institute have combined GPUs, deep learning algorithms, and data analytics and extraction technologies with ORNL’s Titan supercomputer.
“The goal is to be able to tell as a nation if we’re making progress” in battling cancer, said Georgia Tourassi, director of the Health Data Sciences Institute.
Deep Learning Speeds Side-by-Side Projects
Tourassi’s team is tackling both pathology reports and obituaries in two separate projects intended to provide new insights into the patterns of cancer. The obituary project, now in its fourth year, has been fully funded by an NCI grant. Researchers have been working on developing analytical tools that can perform automated research and thus be leveraged to perform more comprehensive epidemiological studies.
In the latter stages of the project, Tourassi’s team has been using a practice known as data parallelism. In this technique, data is divided among separate computing nodes on Titan, allowing the same process to be applied to different data segments simultaneously. This is speeding up efforts to establish a deep learning network that will improve the data analysis and extraction efforts.
In the meantime, Tourassi’s team has been asked to use a similar approach to analyze millions of cancer pathology reports. While not as far along as the obituary work, this project figures to benefit more from the deep learning training, which has been a recent addition to the research.
“Our results show incremental improvements from deep learning compared with traditional rules-based systems,” said Tourassi. “It is very promising, and we will continue working on it.”
The Challenge of ‘Big-but-Dirty’ Data
Much of traditional text mining systems and early deep learning systems rely on experts, who use their knowledge to guide the system’s learning by deciphering clinical text for it. Eventually, deep learning systems will be able to crunch clinical pathology reports and learn without assistance, resulting in an automated and dynamic way of sifting through “big-but-dirty data,” the name Tourassi has given to data for which there’s no way to control the quality.
In both projects, NVIDIA Tesla K20 GPU accelerators are being used to accelerate the deep learning training on Titan. Tourassi reports that the process has been unfolding eight to 10 times faster on GPUs that it did on CPUs for the obituary project. The pathology report project is too fresh to have generated concrete data, but Tourassi sees early indications of similar gains.
“Having seen the clinical performance boost in both applications, I’m a believer” in GPUs, she said. “I now understand the value of scaling these tools for use on the supercomputer.”
And while the goals of both projects are clear, Tourassi hopes to push the efforts, as any good researcher should, so that cancer research findings can be reported in as close to real time as possible.
“We would like to develop the informatics tools and give them to the different registries so they can accelerate information extraction,” she said. “We hope to modernize the cancer surveillance program.”