StringDistances.jl

4.9 KiB

Raw Blame History

This Julia package computes various distances between AbstractStrings

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Compare

The function compare returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. Its syntax is:

compare(s1::AbstractString, s2::AbstractString, dist::StringDistance)

Edit Distances
- Jaro Distance Jaro()
- Levenshtein Distance Levenshtein()
- Damerau-Levenshtein Distance DamerauLevenshtein()
- RatcliffObershelp Distance RatcliffObershelp()
Q-gram distances compare the set of all substrings of length q in each string.
- QGram Distance Qgram(q::Int)
- Cosine Distance Cosine(q::Int)
- Jaccard Distance Jaccard(q::Int)
- Overlap Distance Overlap(q::Int)
- Sorensen-Dice Distance SorensenDice(q::Int)
The package includes distance "modifiers", that can be applied to any distance.
- Winkler boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
- Partial returns the maximal similarity score between the shorter string and substrings of the longer string.
- TokenSort adjusts for differences in word orders by reording words alphabetically.
- TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
- TokenMax combines scores using the base distance, the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths.

Some examples:

compare("martha", "marhta", Jaro())
compare("martha", "marhta", Winkler(Jaro()))
compare("martha", "marhta", QGram(2))
compare("martha", "marhta", Winkler(QGram(2)))
compare("martha", "marhta", Levenshtein())
compare("martha", "marhta", Partial(Levenshtein()))
compare("martha", "marhta", Jaro())
compare("martha", "marhta", TokenSet(Jaro()))
compare("martha", "marhta", TokenMax(RatcliffObershelp()))

A good distance to match strings composed of multiple words (like addresses) is TokenMax(Levenshtein()) (see fuzzywuzzy).

Find

findmax returns the value and index of the element in itr with the highest similarity score with s. Its syntax is:
```
findmax(s::AbstractString, itr, dist::StringDistance; min_score = 0.0)
```
findall returns the indices of all elements in itr with a similarity score with s higher than a minimum value (default to 0.8). Its syntax is:
```
findall(s::AbstractString, itr, dist::StringDistance; min_score = 0.8)
```

The functions findmax and findall are particularly optimized for Levenshtein and DamerauLevenshtein distances (as well as their modifications via Partial, TokenSort, TokenSet, or TokenMax).

Evaluate

The function compare returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function evaluate returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.

compare("New York", "New York", Levenshtein())
#> 1.0
evaluate(Levenshtein(), "New York", "New York")
#> 0

References

The stringdist Package for Approximate String Matching Mark P.J. van der Loo
fuzzywuzzy

4.9 KiB Raw Blame History