StringDistances.jl

Go to file

matthieugomez 8be5a00e3d remove trie		2019-12-12 19:08:49 -05:00
.github/workflows	improve support for missings	2019-12-12 10:57:26 -05:00
src	remove trie	2019-12-12 19:08:49 -05:00
test	allow skipmissing iterator	2019-12-12 17:00:50 -05:00
.gitignore	allow skipmissing iterator	2019-12-12 17:00:50 -05:00
.travis.yml	Add osx, windows, coveralls to travis	2019-12-12 10:42:32 -05:00
LICENSE.md	first commit	2015-10-22 12:12:44 -04:00
Project.toml	rmv benchmarks	2019-12-12 17:09:00 -05:00
README.md	allow skipmissing iterator	2019-12-12 17:00:50 -05:00

README.md

This Julia package computes various distances between AbstractStrings

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Compare

The function compare returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. Its syntax is:

compare(::AbstractString, ::AbstractString, ::StringDistance)

Edit Distances
- Jaro Distance Jaro()
- Levenshtein Distance Levenshtein()
- Damerau-Levenshtein Distance DamerauLevenshtein()
- RatcliffObershelp Distance RatcliffObershelp()
Q-gram distances compare the set of all substrings of length q in each string.
- QGram Distance Qgram(q)
- Cosine Distance Cosine(q)
- Jaccard Distance Jaccard(q)
- Overlap Distance Overlap(q)
- Sorensen-Dice Distance SorensenDice(q)
The package includes distance "modifiers", that can be applied to any distance.
- Winkler boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
- Partial returns the maximal similarity score between the shorter string and substrings of the longer string.
- TokenSort adjusts for differences in word orders by reording words alphabetically.
- TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
- TokenMax combines scores using the base distance, the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths.

Some examples:

compare("martha", "marhta", Jaro())
compare("martha", "marhta", Winkler(Jaro()))
compare("william", "williams", QGram(2))
compare("william", "williams", Winkler(QGram(2)))
compare("New York Yankees", "Yankees", Levenshtein())
compare("New York Yankees", "Yankees", Partial(Levenshtein()))
compare("mariners vs angels", "los angeles angels at seattle mariners", Jaro())
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenSet(Jaro()))
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenMax(RatcliffObershelp()))

In case the word order does not matter, a good distance is TokenMax(Levenshtein())

Find

findmax returns the value and index of the element in itr with the highest similarity score with x. Its syntax is:
```
findmax(x::AbstractString, itr, dist::StringDistance; min_score = 0.0)
```
findall returns the indices of all elements in itr with a similarity score with x higher than a minimum value (default to 0.8). Its syntax is:
```
findall(x::AbstractString, itr, dist::StringDistance; min_score = 0.8)
```

The functions findmax and findall are particularly optimized for Levenshtein and DamerauLevenshtein distances (as well as their modifications via Partial, TokenSort, TokenSet, or TokenMax).

Evaluate

The function compare returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function evaluate returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.

compare("New York", "New York", Levenshtein())
#> 1.0
evaluate(Levenshtein(), "New York", "New York")
#> 0

References

The stringdist Package for Approximate String Matching Mark P.J. van der Loo
fuzzywuzzy