Researchers get audio information out of visual data

  • Boffins at MIT, Microsoft and Adobe have developed an algorithm that can analyze almost imperceptible vibrations in video and reconstruct audio with a specially devised algorithm. One example given of this astonishing achievement is the researchers were able to figure out what people were saying in conversation by observing the vibrations of a bag of chips through soundproof glass.

    That last sentence deserves to be savored.

    Similarly, other experiments provided extracted audio that was considered usable from video observations of aluminum foil, the surface of a glass of water, and, again astonishingly enough, the leaves of a house plant.

    The research will be presented at this year's Siggraph, the computer graphics conference.

    “When sound hits an object, it causes the object to vibrate,” says Abe Davis, a graduate student in electrical engineering and computer science at MIT and first author on the new paper. “The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there.”

    Davis' co-authors of the Siggraph paper are Frédo Durand and Bill Freeman, both MIT professors of computer science and engineering; Neal Wadhwa, a graduate student in Freeman’s group; Michael Rubinstein of Microsoft Research, who did his PhD with Freeman; and Gautham Mysore of Adobe Research.

    The process to extract audio by constructing it from video needs the frequency of the video, the frame rate, to be higher than the frequency of the audio signal. So, researchers ended up using very high-speed cameras and recorded video at anywhere from 2,000 to 6,000 frames a second (fps) as opposed to the standard 60 fps. There are commercial cameras that can go up to 100,000 fps but the researchers wanted to use commodity hardware.

    So, even at 60 fps, the researchers were able to extract a reconstructed audio track which, while not as good as the one from the high speed video, was good enough to let them tell whether the speaker was a man or woman, the number of people speaking in a room and even enough acoustic properties to help identify the speaker.

    There are obvious applications in law enforcement and forensics but the researchers prefer to focus on the value of their technique in providing a new kind of imaging solution, one that can leverage the acoustic information extracted from an object to determine is material and structural properties.

    Take a look for yourself: