StringDistances.jl

Go to file

matthieugomez cf1d578bf6 rmv max_dist as internal field		2021-09-13 09:14:02 -04:00
.github/workflows	Update ci.yml	2021-07-31 10:43:15 -07:00
benchmark	don't use reorder since not type stable	2021-09-11 15:46:33 -04:00
src	rmv max_dist as internal field	2021-09-13 09:14:02 -04:00
test	rmv max_dist as internal field	2021-09-13 09:14:02 -04:00
.gitignore	clean tests	2020-11-14 12:37:04 -08:00
LICENSE.md	first commit	2015-10-22 12:12:44 -04:00
Project.toml	remove StringDistance type (since Distance does not exist)	2021-09-12 15:06:31 -04:00
README.md	remove StringDistance type (since Distance does not exist)	2021-09-12 15:06:31 -04:00

README.md

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Supported Distances

The package defines two abstract types: StringSemiMetric <: SemiMetric, and StringMetric <: Metric. String distances inherit from one of these two types. They act over any pair of iterators that define length (this includes AbstractStrings, but also GraphemeIterators or AbstractVectors)

The available distances are:

Edit Distances
- Hamming Distance Hamming() <: SemiStringMetric
- Jaro and Jaro-Winkler Distance Jaro() JaroWinkler() <: SemiStringMetric
- Levenshtein Distance Levenshtein() <: StringMetric
- Optimal String Alignement Distance (a.k.a. restricted Damerau-Levenshtein) OptimalStringAlignement() <: SemiStringMetric
- Damerau-Levenshtein Distance DamerauLevenshtein() <: StringMetric
- RatcliffObershelp Distance RatcliffObershelp() <: SemiStringMetric
Q-gram distances compare the set of all substrings of length q in each string.
- QGram Distance Qgram(q::Int) <: SemiStringMetric
- Cosine Distance Cosine(q::Int) <: SemiStringMetric
- Jaccard Distance Jaccard(q::Int) <: SemiStringMetric
- Overlap Distance Overlap(q::Int) <: SemiStringMetric
- Sorensen-Dice Distance SorensenDice(q::Int) <: SemiStringMetric
- MorisitaOverlap Distance MorisitaOverlap(q::Int) <: SemiStringMetric
- Normalized Multiset Distance NMD(q::Int) <: SemiStringMetric

Basic Use

distance

You can always compute a certain distance between two strings using the following syntax:

evaluate(dist, s1, s2)
dist(s1, s2)

For instance, with the Levenshtein distance,

evaluate(Levenshtein(), "martha", "marhta")
Levenshtein()("martha", "marhta")

In contrast, the function compare returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns an element of type Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.

compare("martha", "martha", Levenshtein())
#> 1.0

pairwise

pairwise returns the matrix of distance between two AbstractVectors of AbstractStrings (or iterators)

pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])

The function pairwise is particularly optimized for QGram-distances (each element is processed only once).

fuzzywuzzy

The package also defines Distance "modifiers" that are defined in the Python package - fuzzywuzzy. These modifiers are particularly helpful to match strings composed of multiple words (e.g. addresses, company names).

Partial returns the minimum of the distance between the shorter string and substrings of the longer string.
TokenSort adjusts for differences in word orders by returning the distance of the two strings, after re-ordering words alphabetically.
TokenSet adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
TokenMax normalizes the distance, and combine the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string. TokenMax(Levenshtein()) corresponds to the distance defined in fuzzywuzzy

find

The package also adds some convience function to find the element in a list that is closest to a given string

findnearest returns the value and index of the element in itr with the highest similarity score with s. Its syntax is:
```
findnearest(s, itr, dist::StringDistance)
```
findall returns the indices of all elements in itr with a similarity score with s higher than a minimum value (default to 0.8). Its syntax is:
```
findall(s, itr, dist::StringDistance; min_score = 0.8)
```

The functions findnearest and findall are particularly optimized for the Levenshtein and OptimalStringAlignement distances (these distances stop early if the distance is higher than a certain threshold).

Notes

All string lookups are case sensitive.