StringDistances.jl

4.4 KiB

Raw Blame History

This Julia package computes various distances between strings.

Distances

Edit Distances

Q-Grams Distances

Q-gram distances compare the set of all substrings of length q in each string.

Others

Syntax

The function compare returns a similarity score between two strings, based on their distance. The similarity score is always between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.

using StringDistances
compare(Hamming(), "martha", "marhta")
#> 0.6666666666666667
compare(QGram(2), "martha", "marhta")
#> 0.4

To return the litteral distance between two strings, use evaluate

Modifiers

The package includes distance modifiers:

Winkler boosts the similary score of strings with common prefixes

compare(Jaro(), "martha", "marhta")
#> 0.9444444444444445
compare(Winkler(Jaro()), "martha", "marhta")
#> 0.9611111111111111

The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.

compare(QGram(2), "william", "williams")
#> 0.9230769230769231
compare(Winkler(QGram(2)), "william", "williams")
#> 0.9538461538461539

The Python library fuzzywuzzy defines a few modifiers for the RatcliffObershelp similarity score. This package replicates them and extends them to any string distance:

Partial returns the maximal similarity score between the shorter string and substrings of the longer string.

compare(Levenshtein(), "New York Yankees", "Yankees")
#> 0.4375
compare(Partial(Levenshtein()), "New York Yankees", "Yankees")
#> 1.0

TokenSort adjusts for differences in word orders by reording words alphabetically.

compare(RatcliffObershelp(), "mariners vs angels", "angels vs mariners")
#> 0.44444
compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")
#> 1.0

TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.

compare(Jaro(),"mariners vs angels", "los angeles angels at seattle mariners")
#> 0.559904
compare(TokenSet(Jaro()),"mariners vs angels", "los angeles angels at seattle mariners")
#> 0.944444

TokenMax combines scores using the base distance, the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths.
```
compare(TokenMax(RatcliffObershelp()),"mariners vs angels", "los angeles angels at seattle mariners")
#> 0.855
```

## References
- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo
- [fuzzywuzzy blog post](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)

4.4 KiB Raw Blame History