Bleg 1: String Distance

Mar 27 2014

String distance measurements are useful for cleaning up the sort of messy data from multiple sources.

There are a bunch of string distance algorithms, which usually rely on some form of calculations about the similarities of characters. But in real life, characters are rarely the relevant units: you want a distance measure that penalized changes to the most information-laden parts of the text more heavily than to the parts that are filler.

Real-world example: say you’re trying to match two lists of universities to each other. In one you have:

[500 university names…]

Rutgers the State University of New Jersey

and in the other you have:

[499 university names…]

Rutgers University

New Hampshire State University

By most string distance measures, ‘State University’ and ‘New’ will make the long version of Rutgers match New Hampshire State, not Rutgers. But in the context of those 500 other names, that’s not the correct match to make. The phrase “State University” actually conveys very little information (I’d guess fewer than 8 bits) , but that “R-u-t-g-e-r-s” are characters you should lose lots of points for changing. (Rough guess, 14 bits).

In practice, I often get around this by changing the string vocabulary by hand. (Change all occurrences of “University” to “Uni”, etc., ) I can imagine a few ways to solve this: eg., normalized compression distance starting from a file of everything, or calculating a standard string distance metric on a compressed version of names instead of the English version. But I feel like this must exist, and my Internet searches just won’t find it.