Helping you research your family history around the world.
Friday, December 6, 2013
Enable computers to read historical handwritten documents
little over a year ago, Mocavo acquired ReadyMicro and the incredible
mind known as Matt Garner. One of Matt’s lifelong passions and
curiosities is to enable computers to read historical handwritten
documents to bring genealogy search
to the next level. It’s well known in the genealogy industry that
historical handwriting recognition is the Holy Grail – the single
largest technological advancement that would enable more content to
become accessible online (except for maybe the invention of the Web).
For the past year, we’ve joined with Matt to tackle this very hard
problem, and have finally made enough progress that we can begin to
report on it.
me start by explaining the problem. Ask a computer to read the page
below and it will stumble all over place.
(optical character recognition) technology could read some of the
typewritten text – but would be confused by the handwriting (and invent
typewritten letters that it thinks it sees inside handwritten text). To
make matters worse, this page has multiple typewritten font types,
including one that looks like cursive handwriting.
first process we had to develop was a way to perfectly separate
handwriting from typewritten text. If we could do this, the OCR could
read the typewritten text, and Matt’s code could attempt to read the
handwritten text. We call this process Handwriting Detection, and we
figured that if the system couldn’t detect the presence of handwriting,
how on Earth would we hope to decipher the marks into words? In the
example below, you can see how our system marks typewritten text in
green and handwritten text in red – with blue to denote what it
believes are graphics or images. It’s not 100% perfect, but hopefully
you agree that it’s headed in the right direction.
that we’ve detected where the handwriting is, we can start having some
fun. Let’s go back 130 years and change the ink from black to blue.
this is just handwriting detection (where we don’t understand what’s
written – we just know that handwriting is there).
Let’s talk recognition.
Historical handwriting recognition is one of the toughest technical
challenges to solve. First, penmanship is entirely unique to the
individual. Second, because it’s historical handwriting, it’s in
cursive. All the letters run together, adding another layer of
complexity. Third, the way we wrote cursive in the 1700′s is different
than the cursive we write now. There are even variations between
decades. Our mind has an incredible capability of seeing through
incomplete sets of data (a missing character stroke, poor handwriting,
an A that sort of looks like an O, etc). Our brains do all of this for
us and we don’t even notice it. When you think about how to describe
this to a computer, you begin to lose your mind! I believe some of the
greatest problems mankind can solve are those that someone would never
have started if they had known how hard the challenge was ahead of
time. Matt fooled himself just enough to start on the problem and now
he’s making real progress from which we are all going to benefit.
Here’s the exciting part: Our recognition technology is starting to
work. With limited vocabularies (potential answers), we’re achieving
90-95% accuracy. Sometimes, the technology is able to read things we’re
convinced are unreadable (but after getting the answer back from the
computer, you realize what was actually written). We grow closer to the
Holy Grail every day and can’t wait until we can use the technology to
bring more content online, free forever.
Matt and I will keep you updated on our progress over the coming weeks
and months, which should hopefully make for some exciting news in