Friday, December 6, 2013

Enable computers to read historical handwritten documents


A little over a year ago, Mocavo acquired ReadyMicro and the incredible mind known as Matt Garner. One of Matt’s lifelong passions and curiosities is to enable computers to read historical handwritten documents to bring genealogy search to the next level. It’s well known in the genealogy industry that historical handwriting recognition is the Holy Grail – the single largest technological advancement that would enable more content to become accessible online (except for maybe the invention of the Web). For the past year, we’ve joined with Matt to tackle this very hard problem, and have finally made enough progress that we can begin to report on it.
 
Let me start by explaining the problem. Ask a computer to read the page below and it will stumble all over place.
 
 
OCR (optical character recognition) technology could read some of the typewritten text – but would be confused by the handwriting (and invent typewritten letters that it thinks it sees inside handwritten text). To make matters worse, this page has multiple typewritten font types, including one that looks like cursive handwriting.
 
The first process we had to develop was a way to perfectly separate handwriting from typewritten text. If we could do this, the OCR could read the typewritten text, and Matt’s code could attempt to read the handwritten text. We call this process Handwriting Detection, and we figured that if the system couldn’t detect the presence of handwriting, how on Earth would we hope to decipher the marks into words? In the example below, you can see how our system marks typewritten text in green and handwritten text in red – with blue to denote what it believes are graphics or images. It’s not 100% perfect, but hopefully you agree that it’s headed in the right direction.
 
 
Now that we’ve detected where the handwriting is, we can start having some fun. Let’s go back 130 years and change the ink from black to blue.
 
 
Now, this is just handwriting detection (where we don’t understand what’s written – we just know that handwriting is there).

Let’s talk recognition.

Historical handwriting recognition is one of the toughest technical challenges to solve. First, penmanship is entirely unique to the individual. Second, because it’s historical handwriting, it’s in cursive. All the letters run together, adding another layer of complexity. Third, the way we wrote cursive in the 1700′s is different than the cursive we write now. There are even variations between decades. Our mind has an incredible capability of seeing through incomplete sets of data (a missing character stroke, poor handwriting, an A that sort of looks like an O, etc). Our brains do all of this for us and we don’t even notice it. When you think about how to describe this to a computer, you begin to lose your mind! I believe some of the greatest problems mankind can solve are those that someone would never have started if they had known how hard the challenge was ahead of time. Matt fooled himself just enough to start on the problem and now he’s making real progress from which we are all going to benefit.

Here’s the exciting part: Our recognition technology is starting to work. With limited vocabularies (potential answers), we’re achieving 90-95% accuracy. Sometimes, the technology is able to read things we’re convinced are unreadable (but after getting the answer back from the computer, you realize what was actually written). We grow closer to the Holy Grail every day and can’t wait until we can use the technology to bring more content online, free forever.

Matt and I will keep you updated on our progress over the coming weeks and months, which should hopefully make for some exciting news in genealogy.
 
Best Regards,
 
Cliff Shaw
Founder

No comments:

Post a Comment