6ef1cfc8b2 | ||
---|---|---|
.github/workflows | ||
src | ||
test | ||
.gitignore | ||
.travis.yml | ||
LICENSE.md | ||
Project.toml | ||
README.md |
README.md
This Julia package computes various distances between AbstractStrings
Installation
The package is registered in the General
registry and so can be installed at the REPL with ] add StringDistances
.
Compare
The function compare
returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. Its syntax is:
compare(s1::AbstractString, s2::AbstractString, dist::StringDistance)
-
Edit Distances
- Jaro Distance
Jaro()
- Levenshtein Distance
Levenshtein()
- Damerau-Levenshtein Distance
DamerauLevenshtein()
- RatcliffObershelp Distance
RatcliffObershelp()
- Jaro Distance
-
Q-gram distances compare the set of all substrings of length
q
in each string.- QGram Distance
Qgram(q::Int)
- Cosine Distance
Cosine(q::Int)
- Jaccard Distance
Jaccard(q::Int)
- Overlap Distance
Overlap(q::Int)
- Sorensen-Dice Distance
SorensenDice(q::Int)
- QGram Distance
-
The package includes distance "modifiers", that can be applied to any distance.
- Winkler boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
- Partial returns the maximal similarity score between the shorter string and substrings of the longer string.
- TokenSort adjusts for differences in word orders by reording words alphabetically.
- TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
- TokenMax combines scores using the base distance, the
Partial
,TokenSort
andTokenSet
modifiers, with penalty terms depending on string lengths.
Some examples:
compare("martha", "marhta", Jaro())
compare("martha", "marhta", Winkler(Jaro()))
compare("martha", "marhta", QGram(2))
compare("martha", "marhta", Winkler(QGram(2)))
compare("martha", "marhta", Levenshtein())
compare("martha", "marhta", Partial(Levenshtein()))
compare("martha", "marhta", Jaro())
compare("martha", "marhta", TokenSet(Jaro()))
compare("martha", "marhta", TokenMax(RatcliffObershelp()))
A good distance to match strings composed of multiple words (like addresses) is TokenMax(Levenshtein())
(see fuzzywuzzy).
Find
-
findmax
returns the value and index of the element initr
with the highest similarity score withs
. Its syntax is:findmax(s::AbstractString, itr, dist::StringDistance; min_score = 0.0)
-
findall
returns the indices of all elements initr
with a similarity score withs
higher than a minimum value (default to 0.8). Its syntax is:findall(s::AbstractString, itr, dist::StringDistance; min_score = 0.8)
The functions findmax
and findall
are particularly optimized for Levenshtein
and DamerauLevenshtein
distances (as well as their modifications via Partial
, TokenSort
, TokenSet
, or TokenMax
).
Evaluate
The function compare
returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function evaluate
returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.
compare("New York", "New York", Levenshtein())
#> 1.0
evaluate(Levenshtein(), "New York", "New York")
#> 0
References
- The stringdist Package for Approximate String Matching Mark P.J. van der Loo
- fuzzywuzzy