by Calisa Cole

Recently we had a chance to interview CEO Ben Jiang of speech indexing startup, Nexiwave. Ben aims to help people retrieve spoken words as easily as we google text and images. Ben co-founded Cambridge, Mass.-based Nexiwave in 2008 with Nickolay Shmyrev and graduated from MIT, where he initiated a high-performance computing cluster. Take a look below for an excerpt of the interview.

Nexiwave CEO Ben Jiang

NVIDIA: Ben, what makes speech indexing compelling?

Ben: Ninety percent of human communication is through speech. The amount of spoken words that could potentially be indexed and searched is staggering. Skype callers have logged over 100 billion minutes of talk time. Conference call companies are carrying over a billion minutes of calls per month. There are hundreds of millions of podcasts on the web, with 24 hours of video uploaded to YouTube every minute.

The problem is that today's information retrieval applications, such as internet search, focus on textual content. Information retrieval from speech content still relies primarily on a human's memory. The objective of speech indexing is to enable us to easily extract information from archived audio and video content. Through the Nexiwave system, an end user can easily search the content and locate the exact location of interest, whether it's a word, a phrase or a general topic.

NVIDIA: What are some of the potentially big applications of speech indexing?

Ben: Think about the conference calls that happen 24×7 at companies around the world. We've all had moments where we thought: "Ahh, John said something really useful in the last call. I wish I could remember exactly what he said." In the future, with speech indexing-enabled conference calls, we will be able to easily do that via a quick search to locate the exact audio snippet. Another interesting market is call centers, where the ability to do a deep search (not just time of call and phone number) will enable companies to find out what customers are really telling them. Other markets are e-discovery (in the legal field), recorded educational media, podcasts and audio-centric enterprises.

NVIDIA: What stage is your technology in?

Ben: Nexiwave 1.0 was released in October 2009. Nexiwave 2.0, our NVIDIA GPU-enabled version, was released on June 3, 2010 and is in production. We offer a SaaS (software as a service) and cloud computing solution as well as software licenses.

NVIDIA: What is the connection between Nexiwave and CMU Sphinx, the speech recognition system from Carnegie Mellon?

Ben: CMU Sphinx is a very popular open source speech processing engine. Our system is built on top of it with many of our own proprietary improvements, such as CUDA-based acoustic scoring (a total re-write of the core acoustic scoring code). We are one of the major commercial companies contributing to it through code fixes, developer resources and user forum support.

NVIDIA: Where does the GPU fit into this?

Ben: Speech indexing is computationally intensive and has traditionally been very expensive. Speech indexing can be efficiently processed in parallel which means the GPU is a perfect fit for it. The GPU will solve the cost issue associated with indexing vast amounts of audio content quickly and accurately.

NVIDIA: How did you like programming/porting in the CUDA C environment?

Ben: Our experience with programming in CUDA C has been enjoyable. The CUDA Best Practices Guide provided tons of help in performance tuning.

NVIDIA: How does CUDA help you?

Ben: Nexiwave has been able to move 75% of our computing processes (or 11 million computation loops per audio minute) to CUDA C. This directly translates into cost reduction (we have released a large number of CPU machines back to our computing provider). The exciting thing about this speedup is that it enables us to move into markets where speech indexing has not been possible before.