4.3 KiB
This Julia package computes various distances between strings.
Distances
Edit Distances
Q-Grams Distances
Q-gram distances compare the set of all substrings of length q
in each string.
- QGram Distance
- Cosine Distance
- Jaccard Distance
- Overlap Distance
- Sorensen-Dice Distance
Others
Syntax
The function compare
returns a similarity score between two strings, based on their distance. The similarity score is always between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.
using StringDistances
compare(Hamming(), "martha", "marhta")
#> 0.6666666666666667
compare(QGram(2), "martha", "marhta")
#> 0.4
Modifiers
The package includes distance modifiers:
-
Winkler boosts the similary score of strings with common prefixes
compare(Jaro(), "martha", "marhta") #> 0.9444444444444445 compare(Winkler(Jaro()), "martha", "marhta") #> 0.9611111111111111
The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
compare(QGram(2), "william", "williams") #> 0.9230769230769231 compare(Winkler(QGram(2)), "william", "williams") #> 0.9538461538461539
-
The Python library fuzzywuzzy defines a few modifiers for the
RatcliffObershelp
similarity score. This package replicates them and extends them to any string distance:-
Partial returns the maximal similarity score between the shorter string and substrings of the longer string.
compare(Levenshtein(), "New York Yankees", "Yankees") #> 0.4375 compare(Partial(Levenshtein()), "New York Yankees", "Yankees") #> 1.0
-
TokenSort adjusts for differences in word orders by reording words alphabetically.
compare(RatcliffObershelp(), "mariners vs angels", "angels vs mariners") #> 0.44444 compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners") #> 1.0
-
TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
compare(Jaro(),"mariners vs angels", "los angeles angels at seattle mariners") #> 0.559904 compare(TokenSet(Jaro()),"mariners vs angels", "los angeles angels at seattle mariners") #> 0.944444
-
TokenMax combines scores using the base distance, the
Partial
,TokenSort
andTokenSet
modifiers, with penalty terms depending on string lengths.compare(TokenMax(RatcliffObershelp()),"mariners vs angels", "los angeles angels at seattle mariners") #> 0.855
-
References
- The stringdist Package for Approximate String Matching Mark P.J. van der Loo
- fuzzywuzzy blog post