![]() |
||
---|---|---|
.github/workflows | ||
benchmark | ||
src | ||
test | ||
.gitignore | ||
.travis.yml | ||
LICENSE.md | ||
Project.toml | ||
README.md |
README.md
This Julia package computes various distances between AbstractString
s
Installation
The package is registered in the General
registry and so can be installed at the REPL with ] add StringDistances
.
Compare
The function compare
returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. Its syntax is:
compare(::AbstractString, ::AbstractString, ::StringDistance)
-
Edit Distances
- Jaro Distance
Jaro()
- Levenshtein Distance
Levenshtein()
- Damerau-Levenshtein Distance
DamerauLevenshtein()
- RatcliffObershelp Distance
RatcliffObershelp()
- Jaro Distance
-
Q-gram distances compare the set of all substrings of length
q
in each string.- QGram Distance
Qgram(q)
- Cosine Distance
Cosine(q)
- Jaccard Distance
Jaccard(q)
- Overlap Distance
Overlap(q)
- Sorensen-Dice Distance
SorensenDice(q)
- QGram Distance
-
The package includes distance "modifiers", that can be applied to any distance.
- Winkler boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
- Partial returns the maximal similarity score between the shorter string and substrings of the longer string.
- TokenSort adjusts for differences in word orders by reording words alphabetically.
- TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
- TokenMax combines scores using the base distance, the
Partial
,TokenSort
andTokenSet
modifiers, with penalty terms depending on string lengths.
Some examples:
compare("martha", "marhta", Jaro())
compare("martha", "marhta", Winkler(Jaro()))
compare("william", "williams", QGram(2))
compare("william", "williams", Winkler(QGram(2)))
compare("New York Yankees", "Yankees", Levenshtein())
compare("New York Yankees", "Yankees", Partial(Levenshtein()))
compare("mariners vs angels", "los angeles angels at seattle mariners", Jaro())
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenSet(Jaro()))
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenMax(RatcliffObershelp()))
A good distance to link adresses etc (where the word order does not matter) is TokenMax(Levenshtein()
Find
-
findmax
returns the value and index of the element initer
with the highest similarity score withx
. Its syntax is:findmax(x::AbstractString, iter::AbstractString, dist::StringDistance)
-
findall
returns the indices of all elements initer
with a similarity score withx
higher than a minimum value (default to 0.8). Its syntax is:findall(x::AbstractString, iter::AbstractVector, dist::StringDistance)
The functions findmax
and findall
are particularly optimized for Levenshtein
and DamerauLevenshtein
distances (as well as their modifications via Partial
, TokenSort
, TokenSet
, or TokenMax
).
Evaluate
The function compare
returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function evaluate
returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.
compare("New York", "New York", Levenshtein())
#> 1.0
evaluate(Levenshtein(), "New York", "New York")
#> 0
References
- The stringdist Package for Approximate String Matching Mark P.J. van der Loo
- fuzzywuzzy