StringDistances.jl

Go to file

matthieugomez 82d5f3bc91 remove Hamming, create StringDistance		2019-12-12 15:11:32 -05:00
.github/workflows	improve support for missings	2019-12-12 10:57:26 -05:00
benchmark	remove Hamming, create StringDistance	2019-12-12 15:11:32 -05:00
src	remove Hamming, create StringDistance	2019-12-12 15:11:32 -05:00
test	remove Hamming, create StringDistance	2019-12-12 15:11:32 -05:00
.gitignore	result_type for str metrics; fix type instability in RatcliffObershelp	2019-12-12 10:42:32 -05:00
.travis.yml	Add osx, windows, coveralls to travis	2019-12-12 10:42:32 -05:00
LICENSE.md	first commit	2015-10-22 12:12:44 -04:00
Project.toml	parellelize find functions	2019-12-12 13:21:36 -05:00
README.md	remove Hamming, create StringDistance	2019-12-12 15:11:32 -05:00

README.md

This Julia package computes various distances between AbstractStrings

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Compare

The function compare returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. Its syntax is:

compare(::AbstractString, ::AbstractString, ::StringDistance)

Edit Distances
- Jaro Distance Jaro()
- Levenshtein Distance Levenshtein()
- Damerau-Levenshtein Distance DamerauLevenshtein()
- RatcliffObershelp Distance RatcliffObershelp()
Q-gram distances compare the set of all substrings of length q in each string.
- QGram Distance Qgram(q)
- Cosine Distance Cosine(q)
- Jaccard Distance Jaccard(q)
- Overlap Distance Overlap(q)
- Sorensen-Dice Distance SorensenDice(q)
The package includes distance "modifiers", that can be applied to any distance.
- Winkler boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
- Partial returns the maximal similarity score between the shorter string and substrings of the longer string.
- TokenSort adjusts for differences in word orders by reording words alphabetically.
- TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
- TokenMax combines scores using the base distance, the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths.

Some examples:

compare("martha", "marhta", Jaro())
compare("martha", "marhta", Winkler(Jaro()))
compare("william", "williams", QGram(2))
compare("william", "williams", Winkler(QGram(2)))
compare("New York Yankees", "Yankees", Levenshtein())
compare("New York Yankees", "Yankees", Partial(Levenshtein()))
compare("mariners vs angels", "los angeles angels at seattle mariners", Jaro())
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenSet(Jaro()))
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenMax(RatcliffObershelp()))

A good distance to link adresses etc (where the word order does not matter) is TokenMax(Levenshtein()

Find

findmax returns the value and index of the element in iter with the highest similarity score with x. Its syntax is:
```
findmax(x::AbstractString, iter::AbstractString, dist::StringDistance)
```
findall returns the indices of all elements in iter with a similarity score with x higher than a minimum value (default to 0.8). Its syntax is:
```
findall(x::AbstractString, iter::AbstractVector, dist::StringDistance)
```

The functions findmax and findall are particularly optimized for Levenshtein and DamerauLevenshtein distances (as well as their modifications via Partial, TokenSort, TokenSet, or TokenMax).

Evaluate

The function compare returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function evaluate returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.

compare("New York", "New York", Levenshtein())
#> 1.0
evaluate(Levenshtein(), "New York", "New York")
#> 0

References

The stringdist Package for Approximate String Matching Mark P.J. van der Loo
fuzzywuzzy