StringDistances.jl

Go to file

Matthieu Gomez d47e43a757 add tests and tag new version		2024-04-07 18:44:01 -04:00
.github/workflows	Update CompatHelper.yml	2023-06-01 10:04:01 -04:00
benchmark	Rename `OptimalStringAlignement` to `OptimalStringAlignment` (#57 )	2021-12-02 13:23:41 -05:00
src	fix thread unsafety (#63 )	2024-04-07 15:38:56 -04:00
test	add tests and tag new version	2024-04-07 18:44:01 -04:00
.gitignore	clean tests	2020-11-14 12:37:04 -08:00
LICENSE.md	first commit	2015-10-22 12:12:44 -04:00
Project.toml	add tests and tag new version	2024-04-07 18:44:01 -04:00
README.md	Update README.md	2022-10-06 09:28:07 -04:00

README.md

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Supported Distances

String distances act over any pair of iterators that define length (e.g. AbstractStrings, GraphemeIterators, or AbstractVectors)

The available distances are:

Edit Distances
- Hamming Distance Hamming() <: SemiMetric
- Jaro and Jaro-Winkler Distance Jaro() JaroWinkler() <: SemiMetric
- Levenshtein Distance Levenshtein() <: Metric
- Optimal String Alignment Distance (a.k.a. restricted Damerau-Levenshtein) OptimalStringAlignment() <: SemiMetric
- Damerau-Levenshtein Distance DamerauLevenshtein() <: Metric
- RatcliffObershelp Distance RatcliffObershelp() <: SemiMetric
Q-gram distances (which compare the set of all substrings of length q in each string)
- QGram Distance QGram(q::Int) <: SemiMetric
- Cosine Distance Cosine(q::Int) <: SemiMetric
- Jaccard Distance Jaccard(q::Int) <: SemiMetric
- Overlap Distance Overlap(q::Int) <: SemiMetric
- Sorensen-Dice Distance SorensenDice(q::Int) <: SemiMetric
- MorisitaOverlap Distance MorisitaOverlap(q::Int) <: SemiMetric
- Normalized Multiset Distance NMD(q::Int) <: SemiMetric

Syntax

Following the Distances.jl package, string distances can inherit from two abstract types: StringSemiMetric <: SemiMetric or StringMetric <: Metric.

Computing the distance between two strings (or iterators)

You can always compute a certain distance between two strings using the following syntax

r = evaluate(dist, x, y)
r = dist(x, y)

Here, dist is an instance of a distance type: for example, the type for the Levenshtein distance is Levenshtein. You can compute the Levenshtein distance between x and y as

r = evaluate(Levenshtein(), x, y)
r = Levenshtein()(x, y)

The function compare returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns an element of type Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.

Levenshtein()("martha", "martha")
#> 0
compare("martha", "martha", Levenshtein())
#> 1.0

Computing the distance between two AbstractVectors of strings (or iterators)

Consider X and Y two AbstractVectors of iterators. You can compute the matrix of distances across elements, dist(X[i], Y[j]), as follows:

pairwise(dist, X, Y)

For instance,

pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])

pairwise is optimized in various ways (e.g., for the case of QGram-distances, dictionary of qgrams are pre-computed)

Find closest string

The package also adds convenience functions to find elements in a iterator of strings closest to a given string

findnearest returns the value and index of the element in itr with the highest similarity score with s. Its syntax is:
```
findnearest(s, itr, dist)
```
findall returns the indices of all elements in itr with a similarity score with s higher than a minimum score. Its syntax is:
```
findall(s, itr, dist; min_score = 0.8)
```

The functions findnearest and findall are particularly optimized for the Levenshtein and OptimalStringAlignment distances, as these algorithm can stop early if the distance becomes higher than a certain threshold.

fuzzywuzzy

The package also defines Distance "modifiers" that are inspired by the Python package - fuzzywuzzy. These modifiers are particularly helpful to match strings composed of multiple words (e.g. addresses, company names).

Partial returns the minimum of the distance between the shorter string and substrings of the longer string.
TokenSort adjusts for differences in word orders by returning the distance of the two strings, after re-ordering words alphabetically.
TokenSet adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
TokenMax normalizes the distance, and combine the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string. TokenMax(Levenshtein()) corresponds to the distance defined in fuzzywuzzy

Levenshtein()("this string", "this string is longer") = 10
Partial(Levenshtein())("this string", "this string is longer") = 0

Notes

All string distances are case sensitive.