Let There Be Sight: How Deep Learning Is Helping the Blind ‘See’

by Tony Kontzer

Guide dogs are great for helping people who are blind or visually impaired navigate the world. But try getting a dog to read aloud a sign or tell you how much money is in your wallet.

Seeing AI, an app developed by Microsoft AI & Research, has the answers. It essentially narrates the world for blind and low-vision users, allowing them to use their smartphones to identify everything from an object or a color to a dollar bill or a document.

Since the app’s launch last year, it’s been downloaded 150,000 times and used in 5 million tasks, some of which were completed on behalf of one of the world’s most famous blind people.

“Stevie Wonder uses it every day, which is pretty cool,” said Anirudh Koul, a senior data scientist with Microsoft, during a presentation at the GPU Technology Conference in San Jose last month.

A live demo of the app showed just how powerful it can be. Koul had a colleague join him on stage, and when he launched the app on his smartphone and pointed it toward his co-worker, it declared that it was looking at “a 31-year-old man with black hair, wearing glasses, looking happy.”

It could’ve been even better if the colleague was on his list of contacts, as Seeing AI integrates with a user’s contacts to identify friends by name.

Koul also shared a couple of compelling use cases, including a blind teacher who leaves the app running and facing the door of her classroom so the kids can’t take advantage of her lack of sight by sneaking in or out. Another user navigated through a hurricane-ravaged area, using the app to avoid downed power lines and other obstacles.

Seeing AI started with a February 2014 effort to create a convolutional neural network that could help find and identify surrounding objects. But the latency was 10 seconds — too slow to help someone trying to make quick decisions.

The following year, Microsoft sponsored a one-week hackathon that attracted 13,000 participants and led to a second attempt that involved mounting a cellphone on the user’s head.

After experimenting with using smart glasses, Koul’s team set to work on the application itself. Local training of the network was done on an NVIDIA TITAN X GPU, while the heavier lifting was handed off to an Azure cloud instance running NVIDIA Tesla P100 GPUs. A frame-by-frame analysis determined where each piece of training would occur.

Tuning the AI

Training reflected the uncertain nature of the images the app would have to discern. For example, when training the app to detect money, it had to expose the training network to dirty and out of context photos of currency, as well as zoomed-in photos that feature too small a portion of a bill to know for sure what it is.

The team asked for volunteers to contribute photos, and got all sorts of results, including one in which a cat was playing with a bill, and a second one in which the bill was obscured in ice.

As the model was exposed to all these variations, it eventually was able to calculate the minimum parameters needed to identify any photo.

Koul’s team tweaked the network to lean toward negative classification or not classifying an object at all rather than guessing and potentially identifying a $5 bill as a $10 bill, which would clearly present a problem for a blind user.

Ultimately, one of AI’s hallmarks will prove to be a boon to blind and vision-impaired users, as the app will only get better and more accurate the more it’s used.