For centuries, scientists have assembled and maintained extensive information on plants and stored it in what are known as herbaria — vast numbers of cabinets and drawers – at natural history museums and research institutions across the globe.
They’ve used them to discover and confirm the identity of organisms and catalog their characteristics. Over the past two decades, much of this data has been digitized, and this treasure of text, imagery and samples has become easier to share around the world.
Now, complementary projects at the Smithsonian Institution in the U.S. and the Costa Rica Institute of Technology (ITCR) are tapping the combination of big data analytics, computer vision and GPUs to deepen science’s access — and understanding — of botanical information.
Their use of GPU-accelerated deep learning promises to hasten the work of researchers, who discover and describe about 2,000 species of plants each year, and need to compare them against the nearly 400,000 known species.
Making Plant Identification Picture Perfect
A team at the ITCR published a paper last year detailing its work on a deep learning algorithm that enables image-based identification of organisms recorded on museum herbaria sheets. This work was conducted jointly with experts from CIRAD and Inria, in France.
Both sets of researchers expect their work to fuel a revolution in the field of biodiversity informatics.
“Instead of having to look at millions of images and search through metadata, we’re approaching a time when we’ll be able to do that through machine learning,” said Eric Schuettpelz, a research botanist at the Smithsonian. “The ability to identify something from an image may, in a matter of years, be a rather trivial endeavor.”
And that, in turn, is good news for efforts to preserve natural habitats.
“Plant species identification is particularly important for biodiversity conservation,” Jose Mario Carranza-Rojas, a Ph.D. candidate on the ITCR team.
From Ecotourism to Informatics
The associate professor overseeing the Costa Rica research, Erick Mata-Montero, was on the ground floor of biodiversity informatics’ beginnings. After studying at the University of Oregon, Mata-Montero returned to his native country in 1990 to find Costa Rica amidst an ecotourism boom and an associated effort to create and consolidate protected wildlife areas to conserve the nation’s biodiversity.
To aid the effort’s scientific understanding, Mata-Montero joined Costa Rica’s National Biodiversity Institute. By 1995, he was heading up the organization’s biodiversity informatics program, which quickly became a pioneer in the field.
Mata-Montero’s work feeds directly into his research with Carranza-Rojas, whose master’s thesis focused on algorithmic approaches to improving the identification of plants based on characteristics of their leaves, such as contours, veins and texture. During a four-month internship at CIRAD in France last year, Carranza-Rojas discovered work by Pl@ntNet, a consortium that’s created a mobile app for enabling image-based plant recognition, and the two groups collaborated on the recently published paper.
Keeping the Foot on the Accelerator
For the lab work supporting the plant-identification research, the Costa Rican team trained a convolutional neural network on about 260,000 images using two NVIDIA GeForce GPUs, the Caffe deep learning framework and cuDNN.
“Without this technology, it would’ve been impossible to run the network with such a big dataset,” said Carranza-Rojas. “On common CPUs, it would take forever to train and our experiments would have never finished.”
Since publishing their paper, the team has continued with new experiments focused on image identification of plant images taken in the wild. It’s upgraded to NVIDIA Tesla GPUs for this work, which have delivered a 25x performance gain over the GeForce GTX 1070 GPU it tested earlier this year, and it has accelerated its work with the Theano computation library for Python.
“We can test many ideas in a fraction of the time of previous experiments, which means we can do more science,” said Carranza-Rojas.
Significantly, the team’s approach hasn’t relied on domain-specific knowledge. As a result, Carranza-Rojas expects to be able to apply the work to identification of a variety of organisms such as insects, birds and fish.
On the plant front, while the work has focused on identification of species, the team would like to move to the genus and family level. It’s currently too computationally demanding to deal with all plant species because of the sheer numbers involved. But they hope to take a top-down approach to gathering knowledge at these higher taxonomic levels.
Tackling Mercury Staining
At the Smithsonian, Schuettpelz said his team became aware of the Costa Rican effort while working on their own project. Although the two teams didn’t collaborate, he believes the studies in combination may have a bigger impact.
“Coming at a problem from a couple different angles is ultimately a good thing,” he said.
The Smithsonian team has focused on identifying mercury staining, the result of early botanists treating specimens with the toxic substance to protect them from insects. A goal of their research was to know where mercury staining was prevalent in their collection.
“We can scan a million images and easily see where the plants treated with mercury are,” said Schuettpelz. Those samples with mercury staining can be isolated in special folders.
The Smithsonian team started by building a training set of images of stained and unstained specimens. They evaluated about 1,000 neural networks and found one that could identify stained specimens with 90 percent accuracy.
A Step Further
Emboldened by their success, the team decided to see how their network would do at distinguishing between plants that look similar to a trained eye. They built another dataset with 10,000 images of two hard-to-distinguish plant families, and achieved 96 percent accuracy in distinguishing between them.
Like their peers in Costa Rica, the Smithsonian team credits GPUs with making their research possible. Rebecca Dikow, a research data scientist at the Smithsonian, said that training of their network — which ran on Wolfram Mathematica with CUDA and cuDNN integrated into the mix — would’ve taken hundreds of times as long on a CPU than it did with the two NVIDIA Tesla GPU accelerators in the Smithsonian computing cluster.
“A lot of this work involves iterating over lots of different parameters, tweaking things and then running them through another network,” said Dikow in describing the computing demands.
Similar to the ITCR’s work with Pl@ntNet, the Smithsonian team is pursuing a collaboration with a larger-scale effort — in this case with iDigBio, a National Science Foundation-funded digital repository for biological data. Dikow suggested that such joint efforts will bring out the best in deep learning projects.
“Everyone who’s undertaking these lines of research has the same feeling,” said Dikow. “We really want to make our networks as robust as possible, and so collaboration is definitely the way to go.”