Music to the Ears: ‘Cocktail Party’ Problem Gets a Round of AI

MIT researchers are training deep learning neural networks on music videos to separate sounds from each other.
by Scott Martin

The classic cocktail party problem — how to filter out specific sounds from a variety of background noises — is getting a shot of AI.

Human ears do a great job at deciphering sounds from a din because the brain can focus our attention on what we want to hear. Machine-based “sound source separation,” however, has for years befuddled engineers.

MIT researchers are training neural networks using music videos to better pinpoint sound sources.

The team’s deep learning system “learns directly from a lot of unlabeled YouTube videos, and it gets to know which objects make what kinds of sounds,” said Hang Zhao, an MIT researcher and former NVIDIA Research intern.

It’s work that Zhao describes as groundbreaking, and it has wide-ranging applications in speech, audiology, music and robotics.

Learning Through Binge-Watching

MIT unleashed a novel approach on the problem: train deep neural networks on images and audio from YouTube videos. The aim was to learn to locate the precise image locations — down to the pixel — in videos that produce the sounds.

Dubbed PixelPlayer, the system was trained on 60 hours of music videos from YouTube. It can identify more than 20 instruments so far.

The MIT team, working at the institute’s Computer Science and Artificial Intelligence Lab, developed three convolutional neural networks that work in concert to produce results. One handled encoding visual input, while another encoded audio input and the third synthesized the output based on the visual and audio input.

The PixelPlayer training dataset consisted of 714 YouTube videos. “The convolutional neural network was able to process the data at a very high speed because it was running four NVIDIA TITAN V GPUs,” Zhao said. “It learned in about a day.”

PixelPlayer is self-supervised. It doesn’t require any human intervention to annotate what the instruments are or sound like. Instead, the system has learned how objects such as tubas and trumpets look, how they sound and how they move.

Turn Up the Tuba

After PixelPlayer locates sound sources in a video, it separates their waveforms. It currently works best on identifying two or three different instruments, but the team aims to scale it to more soon. “We are separating one MP3 file into multiple MP3 files,” Zhao said of the process of pulling out the instruments.

PixelPlayer has music applications. Audio engineers could use the AI to boost the level on a faint instrument or mute out some background noise. It could help audio engineers to improve live recordings or remaster music, Zhao said.

Researchers have been developing deep learning for the cocktail party problem in the quest to improve hearing aids as well. (Read “Hear, Hear: How Deep Learning Is Reinventing Hearing Aids.”)

Its use could extend beyond music and audiology applications into identifying sounds in general in the world around us. For example, listening for rare bird calls among the sounds of a forest. “The system could be used by robots to understand environmental sounds,” Zhao said.

The MIT researchers plan to present their work at the European Conference of Computer Vision in Munich, in September.