Google turns audio into Text with GAudi

Posted by Christian Zibreg

Mountain View (CA) - Google quietly released its GAudi service, a new take on online information retrieval. It uses a remarkably accurate speech recognition engine to extract information from audio and video content and turn it into indexed text that can easily be searched. GAudi is currently limited only to politicians' speeches on YouTube, but Google said it will gradually increase its scope.

Google Audio Indexing (GAudi) is currently available as an experimental Google Labs project with very limited content. Currently only covering political content, Google said that "political videos and election materials are a special case of broadcast news content, a domain that has received a lot of academic and industry attention and is known to perform well." According to the company, the service may help people to better follow the views, actions and platforms of the two presidential candidates?"

"The US election is just a first step," Google said. "We see it as an experiment platform where we can learn what features make the best user experience for people looking for spoken content on the Web." So how does it work?  In short: Remarkably well, although it is apparent that GAudi is limited by the weaknesses of today's speech recognition algorithms.

The user interface is just what you would expect from Google: Clean, crisp and focused on search. It enables users to quickly find a video based on simple search queries or even entire sentences. The magic that works behind the curtain is speech recognition technology developed in-house by Google's dedicated speech research group. Machine-assisted speech recognition has been around for some time, but GAudi clearly showcases how much it has improved. GAudi scans audio information from YouTube videos and recognizes spoken words that are turned into text to supplement the existing indexing information, which greatly enhances its search capabilities.

Searches return a list of YouTube videos that contain your queries. When selecting a video from a list, GAudi shows it with an embedded YouTube player that is modified with up to ten yellow marks on a timeline to indicate query mentions within this video. You can navigate to the the yellow marker to read the transcript or click on a mark to directly jump to a portion of the video with query mentions. There is an additional search box to initiate a new search within the current video.

Overall, GAudi's accuracy is impressive, but its engine is not bullet proof. For example, in this video, GAudi incorrectly detected country name "Czechoslovakia" as "tech also but there." Common words are also prone to mistakes: GAudi mistakenly replaced "free" with "forty" in the same video. So expect the current weaknesses of speech recognition to show up in GAudi as. The quality of audio, background noise and preciseness, punctuality and emphasis of the speaker also affect the outcome.

We have no doubt in our mind that GAudi will become an important part of the Google search engine. The technology has the potential to significantly enhance usefulness of the search engine and enable searching rich content in new ways.