How Deep Learning Can Paint Videos in the Style of Art’s Great Masters

by Isha Salian
Starry Night
Imagine your cat walking through this scene.

Thanks to Instagram and Snapchat, adding filters to images and videos is pretty straight forward. But what if you could repaint your smartphone videos in the style of van Gogh’s “Starry Night” or Munch’s “The Scream”?

A team of researchers from Germany’s University of Freiburg has made significant strides toward this goal using an approach to artificial intelligence called deep learning.

The team developed a method that uses a deep neural network to extract a specific artistic style from a source painting, and then synthesizes this information with the content of a separate video. NVIDIA GPUs make it possible to crunch through this computationally intensive work with striking results.

An Algorithm with Long-Term Memory

Prior work has used deep learning to transfer artistic styles from image to image with success. Earlier research found that when a deep neural network processes an image, its neural activation encodes the image’s style information — brushstrokes, color and other abstract details. The network can then be used to apply this style onto what the network understands as the content of a second image.

But videos have lots of moving parts. It’s not as simple as taking the technique of style transfer for still images and applying it to each frame of a video.

“If you just apply the algorithm frame by frame, you don’t get a coherent video — you get flickering in the sequence,” says University of Freiburg postdoc Alexey Dosovitskiy. “What we do is introduce additional constraints, which make the video consistent.”

Dosovitskiy and his fellow researchers enforce this consistency by controlling the variation between one frame and the next, which needs to account for three major challenges:

  • A character onscreen should look the same as it moves across a scene,
  • Static components, such as backdrop, should remain visually consistent from frame to frame, and
  • After a character passes across the field of view, the background should look the way it did before the character moved.

The team’s algorithm incorporates restrictions to solve these issues, penalizing successive frames that look too different from one another. It also uses long-term contingencies to aide continuity — the image composition of an area of a scene from several frames earlier is replicated when that area reappears.

artistic style transfer to video
Smartly constraining a deep learning algorithm produced better consistency in stylizing an animated video.

To make this complex process a reality, the researchers use NVIDIA GPUs. Powered by a GeForce GTX TITAN X GPU, artistic style transfer takes eight to 10 minutes a frame for a high-resolution video. That’s 20x faster than with a multi-core CPU.

“GPUs are crucial because this process is quite time consuming,” Dosovitskiy says.

The team also uses our cuDNN deep learning software, which allows them to perform style transfer on high-resolution videos due to its smaller memory requirements. Multi-GPU systems could speed up the process further — but even so, real-time artistic style transfer for videos is some ways away.

So far, the team has tried its algorithm on both live and animated videos. They render equally well, but Dosovitskiy thinks viewers have higher standards for live video.

“It turns out people are sensitive to this flickering, so even if it’s quite small you can still see it very well when you watch a video,” he says.

Read more about the team’s work in their paper.

Solve the unsolvable. Learn more about DGX-1, the world’s first deep learning supercomputer in a box.