![]() |
3 years ago | |
---|---|---|
.github/workflows | 4 years ago | |
src | 4 years ago | |
test | 4 years ago | |
.gitignore | 4 years ago | |
.travis.yml | 4 years ago | |
LICENSE.md | 8 years ago | |
Project.toml | 3 years ago | |
README.md | 4 years ago |
README.md
Installation
The package is registered in the General
registry and so can be installed at the REPL with ] add StringDistances
.
Supported Distances
The available distances are:
- Edit Distances
- Jaro Distance
Jaro()
- Levenshtein Distance
Levenshtein()
- Damerau-Levenshtein Distance
DamerauLevenshtein()
- RatcliffObershelp Distance
RatcliffObershelp()
- Jaro Distance
- Q-gram distances compare the set of all substrings of length
q
in each string.- QGram Distance
Qgram(q::Int)
- Cosine Distance
Cosine(q::Int)
- Jaccard Distance
Jaccard(q::Int)
- Overlap Distance
Overlap(q::Int)
- Sorensen-Dice Distance
SorensenDice(q::Int)
- QGram Distance
- Distance "modifiers" that can be applied to any distance:
- Winkler diminishes the distance of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but it can be defined for any string distance.
- Partial returns the minimum distance between the shorter string and substrings of the longer string.
- TokenSort adjusts for differences in word orders by reording words alphabetically.
- TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
- TokenMax combines scores using the base distance, the
Partial
,TokenSort
andTokenSet
modifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses.TokenMax(Levenshtein())
corresponds to the distance defined in fuzzywuzzy
Basic Use
Evaluate
You can always compute a certain distance between two strings using the following syntax:
evaluate(dist, s1, s2)
dist(s1, s2)
For instance, with the Levenshtein
distance,
evaluate(Levenshtein(), "martha", "marhta")
Levenshtein()("martha", "marhta")
You can also compute a distance between two iterators:
evaluate(Levenshtein(), [1, 5, 6], [1, 6, 5])
2
Compare
The function compare
is defined as 1 minus the normalized distance between two strings. It always returns a Float64
between 0 and 1: a value of 0 means completely different and a value of 1 means completely similar.
evaluate(Levenshtein(), "martha", "martha")
#> 0
compare("martha", "martha", Levenshtein())
#> 1.0
Find
-
findmax
returns the value and index of the element initr
with the highest similarity score withs
. Its syntax is:findmax(s, itr, dist::StringDistance; min_score = 0.0)
-
findall
returns the indices of all elements initr
with a similarity score withs
higher than a minimum value (default to 0.8). Its syntax is:findall(s, itr, dist::StringDistance; min_score = 0.8)
The functions findmax
and findall
are particularly optimized for Levenshtein
and DamerauLevenshtein
distances (as well as their modifications via Partial
, TokenSort
, TokenSet
, or TokenMax
).
References
- The stringdist Package for Approximate String Matching Mark P.J. van der Loo
- fuzzywuzzy