Pinterest Trains Visual Search Faster with Optimized Architecture on NVIDIA GPUs

It’s scaling up AI that affects more than 240 billion images and a service like Shop the Look that links to retailers.
by Scott Martin

Pinterest now has more than 440 million reasons to offer the best visual search experience. That’s because its monthly active users are tracking this high for its popular image sharing and social media service.

Visual search enables Pinterest users to search for images using text, screenshots or camera photos. It’s the core AI behind how people build their Boards of Pins — collections of images by themes —  around their interests and plans. It’s also how people on Pinterest can take action on the inspiration they discover, such as shopping and making purchases based on the products within scenes.

But tracking more than 240 billion images and 5 billion Boards is no small data trick.

This requires visual embeddings — mathematical representations of objects in a scene. Visual embeddings use models for automatically generating and evaluating visualizations to show how similar two images are — say, a sofa in a TV show’s living room compared to ones for sale at retailers.

Pinterest is improving its search results by pretraining its visual embeddings on a smaller dataset. The overall goal is to improve for one unified visual embedding that can perform well for its key business features.

Powered by NVIDIA V100 Tensor Core GPUs, this technique pre-trains Pinterest’s neural nets on a subset of about 1.3 billion images to yield improved relevancy across the wider set of hundreds of billions of images.

Improving results on the unified visual embedding in this fashion can benefit all applications on Pinterest, said Josh Beal, a machine learning researcher for Visual Search at the company.

“This model is fine-tuned on various multitask datasets. And the goal of this project was to scale the model to a large scale,” he said.

Benefitting Shop the Look 

With so many visuals, and new ones coming in all the time, Pinterest is continuously training its neural networks to identify them in relation to others.

A popular visual search feature, Pinterest’s Shop the Look enables people to shop for home and fashion items. By tapping into visual embeddings, Shop the Look can identify items in Pins and connect Pinners to those products online.

Product matches are key to its visual-driven commerce. And it isn’t an easy problem to solve at Pinterest scale.

Yet it matters. Another Pinterest visual feature is the ability to search specific products within an image, or Pin. Improving the accuracy or recommendations with visual embedding improves the magic factor in matches, boosting people’s experience of discovering relevant products and ideas.

An additional feature, Pinterest’s Lens camera search, aims to recommend visually relevant Pins based on the photos Pinners take with their cameras.

“Unified embedding for visual search benefits all these downstream applications,” said Beal.

Making Visual Search More Powerful

Several Pinterest teams have been working to improve visual search on the hundreds of billions of images within Pins. But given the massive scale of the effort and its cost and engineering resource restraints, Pinterest wanted to optimize its existing architecture.

With some suggested ResNeXt-101 architecture optimizations and by simply upgrading to the latest releases of NVIDIA libraries, including cuDNN v8, automated mixed precision and NCCL, Pinterest was able to improve training performance of their models by over 60 percent.

NVIDIA’s GPU-accelerated libraries are constantly being updated to enable companies such as Pinterest to get more performance out of their existing hardware investment.

“It has improved the quality of the visual embedding, so that leads to more relevant results in visual search,” said Beal.