Google acquires reCAPTCHA for book scans

Teaching computers to read, Google hopes to bolster Google Books and the Google News Archive

Technology trends and news by Ronny Kerr
September 16, 2009 | Comments
Short URL: http://vator.tv/n/aa8

16340

Google has acquired reCAPTCHA, the company known by most users as a provider of those (slightly annoying) tests where you have to type out the squiggly, morphed words displayed to sign in to a site. The idea is to prevent bots from buying all the tickets for a show in the first 10 seconds of the sale or signing up for every available email address.

reCAPTCHA

Google says reCAPTCHA currently guards over 100,000 Web sites from such spam attacks.

The service has much broader applications, though.

reCAPTCHA is aiding the massive task of digitizing books, newspapers and old time radio shows. For physical books, it’s a two-step process: scan a page, then transform into text using "Optical Character Recognition" (OCR).

Unfortunately, even the most sophisticated OCR program cannot easily transcribe just any scanned image of a page of text, for example, because in some older books, either time has taken its toll on the paper and ink or the font is just plain weird. But humans can probably figure out what it means.

fail

According to reCAPTCHA’s Web site:

About 200 million CAPTCHAs are solved by humans around the world every day. In each case, roughly ten seconds of human time are being spent. Individually, that's not a lot of time, but in aggregate these little puzzles consume more than 150,000 hours of work each day.

reCAPTCHA gives users two words. The first is a word reCAPTCHA knows. The second is the word from that ancient or damaged text that the computer is trying to transcribe. If a user gets the first word right, then reCAPTCHA assumes it’s dealing with a human, and accepts the user’s input for the second word. After many run-throughs with many different users, reCAPTCHA pools all the inputs for the second word and assumes the majority answer is probably what the word actually is.

In this way, reCAPTCHA can continually utilize the crowd to correct and improve its OCR.

Google’s acquisition of the company makes a lot of sense, considering that they are currently invested in two large-scale digitization projects: Google Books and the Google News Archive.


Related news


blog comments powered by Disqus
Find your friends' startup new!
Vator is more valuable if you know who's here.
Discover who has a startup and help their success by following their progress!

Featured Stories

Latest company news bites on Vator

Cognitive Code Corporation - Mimi Chen (Co-Founder and President)
Crowdfunding is cool - pre-order SILVIA for your Android here: http://www.kickstarter.com/projects/cogcode/silvia-for-android
See more
BuildingLayer - Nick Such (Co-founder and CEO)
BuildingLayer co-founder and Chief Scientist, John Kiffmeyer, is a Featured Engineer this week on EEWeb http://www.eeweb.com/spotlight/interview-with-john-p.-kiffmeyer
See more
AllowanceTree CEO named Today's Entrepreneur by Vator: http://vator.tv/news/2012-05-25-todays-entrepreneur-arnie-benn
See more
Cognitive Code Corporation - Mimi Chen (Co-Founder and President)