Carnegie Mellon takes new approach to speed up Internet book scanning initiative

  • Pittsburgh (PA) – Students at Carnegie Mellon University believe to have found a solution to one of the major bottlenecks in an initiative that aims to make transform books, newspapers and other printed materials into digitized text that is computer searchable. Words that are not recognized by scan software are used for passphrases on websites – enlisting the help of Internet users to accelerate the initiative.

    The idea behind the Carnegie Mellon project takes a new approach to an already existing technology many Internet users are used to today. Instead of using distorted letters in CAPTCHA passphrases, a technology many websites use in forms to distinguish a human user from computer-controlled scripts, the students believe that words that were not recognized during the scanning process can be used for the same purpose.


    Called ReCAPTCHA (Re- Completely Automated Public Turing Test to Tell Computers and Humans Apart), it is believed that the words, which often are underlined, printed in poor quality or are surrounded by scribbles, deliver the same level of security, as they are not recognized by OCR software. However, they easily can be recognized by humans and can then be inserted back into the scanned book to complement the scanned document, Carnegie Mellon claims.

    Luis von Ahn, an assistant professor of computer science, estimates that more that Internet users are “solving” more than 60 million CAPTCHAs each day, with each test taking about 10 seconds. "That's more than 150,000 precious hours of human work that are lost each day, but that we can put to good use with ReCAPTCHAs," he said.