String Distance

One of my first projects at Leaf has been trying to match some data read from an OCR solution against what is in the database. As sophisticated as OCR algorithms have become though it’s still not reliable enough to guarantee 100% accurate results every time due to the number of variations in type faces, artifacts introduced through scanning or faxing the document, or any number of other factors.

Most of the documents I’ve been working with have been pretty clean and I’ve been able to get an exact match automatically. One of my samples though, has some security features that intentionally obfuscate some of the information I care about. This naturally makes getting an exact match difficult. Amazingly though, the OCR result was about 80% accurate so there was still some hope.

One of my coworkers suggested that I look at some of the string distance algorithms to see if any of them could help us get closer to an exact match. He pointed me at the Levenshtein algorithm so I took a look at that along with the Hamming and Damerau-Levenshtein algorithms.

For the uninitiated (like myself a week ago), these algorithms provide a way to determine the distance between two strings. The distance is essentially a measurement of string similarity. In other words, they calculate how many steps are required to transform one string into another.

I want to look briefly at each of these and show some examples. Note that each of these algorithms are case sensitive but modifying them to ignore case is trivial.

(more…)