matthieugomez 7b9568d028 | 1 week ago | |
---|---|---|
.github/workflows | 2 months ago | |
benchmark | 2 months ago | |
src | 3 weeks ago | |
test | 2 months ago | |
.gitignore | 2 months ago | |
.travis.yml | 2 months ago | |
LICENSE.md | 5 years ago | |
Project.toml | 2 months ago | |
README.md | 1 week ago |
The package is registered in the General
registry and so can be installed at the REPL with ] add StringDistances
.
Distances are defined for AbstractStrings
, and any iterator that define length()
(e.g. graphemes
, AbstractVector
...)
The available distances are:
Hamming()
Jaro()
JaroWinkler()
Levenshtein()
DamerauLevenshtein()
RatcliffObershelp()
q
in each string.
Qgram(q::Int)
Cosine(q::Int)
Jaccard(q::Int)
Overlap(q::Int)
SorensenDice(q::Int)
MorisitaOverlap(q::Int)
NMD(q::Int)
The package also defines Distance "modifiers" that can be applied to any distance.
Partial
, TokenSort
and TokenSet
modifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses. TokenMax(Levenshtein())
corresponds to the distance defined in fuzzywuzzyYou can always compute a certain distance between two strings using the following syntax:
evaluate(dist, s1, s2)
dist(s1, s2)
For instance, with the Levenshtein
distance,
evaluate(Levenshtein(), "martha", "marhta")
Levenshtein()("martha", "marhta")
pairwise
returns the matrix of distance between two AbstractVectors
of AbstractStrings
pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])
It is particularly fast for QGram-distances (each element is processed once).
The function compare
returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns a Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.
Levenshtein()("martha", "martha")
#> 0.0
compare("martha", "martha", Levenshtein())
#> 1.0
findnearest
returns the value and index of the element in itr
with the highest similarity score with s
. Its syntax is:
findnearest(s, itr, dist::StringDistance)
findall
returns the indices of all elements in itr
with a similarity score with s
higher than a minimum value (default to 0.8). Its syntax is:
findall(s, itr, dist::StringDistance; min_score = 0.8)
The functions findnearest
and findall
are particularly optimized for Levenshtein
, DamerauLevenshtein
distances (as well as their modifications via Partial
, TokenSort
, TokenSet
, or TokenMax
).