78c3ec86f8 | ||
---|---|---|
.github/workflows | ||
benchmark | ||
src | ||
test | ||
.gitignore | ||
LICENSE.md | ||
Project.toml | ||
README.md |
README.md
Installation
The package is registered in the General
registry and so can be installed at the REPL with ] add StringDistances
.
Supported Distances
Distances are defined for AbstractStrings
, and any iterator that define length()
(e.g. graphemes
, AbstractVector
...)
The available distances are:
- Edit Distances
- Hamming Distance
Hamming()
- Jaro and Jaro-Winkler Distance
Jaro()
JaroWinkler()
- Levenshtein Distance
Levenshtein()
- Damerau-Levenshtein Distance
DamerauLevenshtein()
- RatcliffObershelp Distance
RatcliffObershelp()
- Hamming Distance
- Q-gram distances compare the set of all substrings of length
q
in each string.- QGram Distance
Qgram(q::Int)
- Cosine Distance
Cosine(q::Int)
- Jaccard Distance
Jaccard(q::Int)
- Overlap Distance
Overlap(q::Int)
- Sorensen-Dice Distance
SorensenDice(q::Int)
- MorisitaOverlap Distance
MorisitaOverlap(q::Int)
- Normalized Multiset Distance
NMD(q::Int)
- QGram Distance
Basic Use
evaluate
You can always compute a certain distance between two strings using the following syntax:
evaluate(dist, s1, s2)
dist(s1, s2)
For instance, with the Levenshtein
distance,
evaluate(Levenshtein(), "martha", "marhta")
Levenshtein()("martha", "marhta")
pairwise
pairwise
returns the matrix of distance between two AbstractVectors
of AbstractStrings
pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])
The function pairwise
is particularly optimized for QGram-distances (each element is processed only once).
similarly score
-
The function
compare
returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns a Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.Levenshtein()("martha", "martha") #> 0.0 compare("martha", "martha", Levenshtein()) #> 1.0
-
findnearest
returns the value and index of the element initr
with the highest similarity score withs
. Its syntax is:findnearest(s, itr, dist::StringDistance)
-
findall
returns the indices of all elements initr
with a similarity score withs
higher than a minimum value (default to 0.8). Its syntax is:findall(s, itr, dist::StringDistance; min_score = 0.8)
The functions findnearest
and findall
are particularly optimized for Levenshtein
, DamerauLevenshtein
distances (these distances stop early if the distance is higher than a certain threshold).
References
- The stringdist Package for Approximate String Matching Mark P.J. van der Loo
- fuzzywuzzy