StringDistances.jl

Go to file

matthieugomez 78c3ec86f8 Update README.md		2021-07-04 10:25:51 -07:00
.github/workflows	Update ci.yml	2021-04-05 14:11:21 -07:00
benchmark	rmv Travis	2021-04-05 13:55:09 -07:00
src	import from StatsAPI	2021-04-21 08:57:50 -07:00
test	clean tests	2020-11-14 12:37:04 -08:00
.gitignore	clean tests	2020-11-14 12:37:04 -08:00
LICENSE.md	first commit	2015-10-22 12:12:44 -04:00
Project.toml	import from StatsAPI	2021-04-21 08:57:50 -07:00
README.md	Update README.md	2021-07-04 10:25:51 -07:00

README.md

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Supported Distances

Distances are defined for AbstractStrings, and any iterator that define length() (e.g. graphemes, AbstractVector...)

The available distances are:

Edit Distances
- Hamming Distance Hamming()
- Jaro and Jaro-Winkler Distance Jaro() JaroWinkler()
- Levenshtein Distance Levenshtein()
- Damerau-Levenshtein Distance DamerauLevenshtein()
- RatcliffObershelp Distance RatcliffObershelp()
Q-gram distances compare the set of all substrings of length q in each string.
- QGram Distance Qgram(q::Int)
- Cosine Distance Cosine(q::Int)
- Jaccard Distance Jaccard(q::Int)
- Overlap Distance Overlap(q::Int)
- Sorensen-Dice Distance SorensenDice(q::Int)
- MorisitaOverlap Distance MorisitaOverlap(q::Int)
- Normalized Multiset Distance NMD(q::Int)

Basic Use

evaluate

You can always compute a certain distance between two strings using the following syntax:

evaluate(dist, s1, s2)
dist(s1, s2)

For instance, with the Levenshtein distance,

evaluate(Levenshtein(), "martha", "marhta")
Levenshtein()("martha", "marhta")

pairwise

pairwise returns the matrix of distance between two AbstractVectors of AbstractStrings

pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])

The function pairwise is particularly optimized for QGram-distances (each element is processed only once).

similarly score

The function compare returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns a Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.
```
Levenshtein()("martha", "martha")
#> 0.0
compare("martha", "martha", Levenshtein())
#> 1.0
```
findnearest returns the value and index of the element in itr with the highest similarity score with s. Its syntax is:
```
findnearest(s, itr, dist::StringDistance)
```
findall returns the indices of all elements in itr with a similarity score with s higher than a minimum value (default to 0.8). Its syntax is:
```
findall(s, itr, dist::StringDistance; min_score = 0.8)
```

The functions findnearest and findall are particularly optimized for Levenshtein, DamerauLevenshtein distances (these distances stop early if the distance is higher than a certain threshold).

References

The stringdist Package for Approximate String Matching Mark P.J. van der Loo
fuzzywuzzy