d47e43a757  

.github/workflows  
benchmark  
src  
test  
.gitignore  
LICENSE.md  
Project.toml  
README.md 
README.md
Installation
The package is registered in the General
registry and so can be installed at the REPL with ] add StringDistances
.
Supported Distances
String distances act over any pair of iterators that define length
(e.g. AbstractStrings
, GraphemeIterators
, or AbstractVectors
)
The available distances are:
 Edit Distances
 Hamming Distance
Hamming() <: SemiMetric
 Jaro and JaroWinkler Distance
Jaro()
JaroWinkler() <: SemiMetric
 Levenshtein Distance
Levenshtein() <: Metric
 Optimal String Alignment Distance (a.k.a. restricted DamerauLevenshtein)
OptimalStringAlignment() <: SemiMetric
 DamerauLevenshtein Distance
DamerauLevenshtein() <: Metric
 RatcliffObershelp Distance
RatcliffObershelp() <: SemiMetric
 Hamming Distance
 Qgram distances (which compare the set of all substrings of length
q
in each string) QGram Distance
QGram(q::Int) <: SemiMetric
 Cosine Distance
Cosine(q::Int) <: SemiMetric
 Jaccard Distance
Jaccard(q::Int) <: SemiMetric
 Overlap Distance
Overlap(q::Int) <: SemiMetric
 SorensenDice Distance
SorensenDice(q::Int) <: SemiMetric
 MorisitaOverlap Distance
MorisitaOverlap(q::Int) <: SemiMetric
 Normalized Multiset Distance
NMD(q::Int) <: SemiMetric
 QGram Distance
Syntax
Following the Distances.jl
package, string distances can inherit from two abstract types: StringSemiMetric <: SemiMetric
or StringMetric <: Metric
.
Computing the distance between two strings (or iterators)
You can always compute a certain distance between two strings using the following syntax
r = evaluate(dist, x, y)
r = dist(x, y)
Here, dist
is an instance of a distance type: for example, the type for the Levenshtein distance is Levenshtein
. You can compute the Levenshtein distance between x
and y
as
r = evaluate(Levenshtein(), x, y)
r = Levenshtein()(x, y)
The function compare
returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns an element of type Float64
. A value of 0.0 means completely different and a value of 1.0 means completely similar.
Levenshtein()("martha", "martha")
#> 0
compare("martha", "martha", Levenshtein())
#> 1.0
Computing the distance between two AbstractVectors of strings (or iterators)
Consider X
and Y
two AbstractVectors
of iterators. You can compute the matrix of distances across elements, dist(X[i], Y[j])
, as follows:
pairwise(dist, X, Y)
For instance,
pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])
pairwise
is optimized in various ways (e.g., for the case of QGramdistances, dictionary of qgrams are precomputed)
Find closest string
The package also adds convenience functions to find elements in a iterator of strings closest to a given string

findnearest
returns the value and index of the element initr
with the highest similarity score withs
. Its syntax is:findnearest(s, itr, dist)

findall
returns the indices of all elements initr
with a similarity score withs
higher than a minimum score. Its syntax is:findall(s, itr, dist; min_score = 0.8)
The functions findnearest
and findall
are particularly optimized for the Levenshtein
and OptimalStringAlignment
distances, as these algorithm can stop early if the distance becomes higher than a certain threshold.
fuzzywuzzy
The package also defines Distance "modifiers" that are inspired by the Python package  fuzzywuzzy. These modifiers are particularly helpful to match strings composed of multiple words (e.g. addresses, company names).
 Partial returns the minimum of the distance between the shorter string and substrings of the longer string.
 TokenSort adjusts for differences in word orders by returning the distance of the two strings, after reordering words alphabetically.
 TokenSet adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
 TokenMax normalizes the distance, and combine the
Partial
,TokenSort
andTokenSet
modifiers, with penalty terms depending on string.TokenMax(Levenshtein())
corresponds to the distance defined in fuzzywuzzy
Levenshtein()("this string", "this string is longer") = 10
Partial(Levenshtein())("this string", "this string is longer") = 0
Notes
 All string distances are case sensitive.