pull/7/head
matthieugomez 2018-05-15 18:47:55 -04:00
parent 2ebe788f8c
commit 4a743452b3
2 changed files with 18 additions and 11 deletions

View File

@ -9,21 +9,21 @@ This Julia package computes various distances between strings.
## Distances
#### Edit Distances
- [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance)
- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance)
- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)
- [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) `Hamming()`
- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`
- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
#### Q-Grams Distances
Q-gram distances compare the set of all substrings of length `q` in each string.
- QGram Distance
- [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity)
- [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index)
- [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient)
- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient)
- QGram Distance `Qgram(q)`
- [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity) `Cosine(q)`
- [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index) `Jaccard(q)`
- [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient) `Overlap(q)`
- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice(q)`
#### Others
- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)
- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html)
- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp()`
## Syntax
The function `evaluate` return the *litteral distance* between two strings.
@ -101,7 +101,7 @@ The package includes distance "modifiers", that can be applied to any distance.
As a rule of thumb,
- Standardize strings before comparing them (correct for uppercases, punctuations, whitespaces, accents, abbreviations...)
- Don't use Edit Distances if word order do not matter.
- The distance `Tokenmax(RatcliffObershelp())' is a good default choice.
- The distance `Tokenmax(RatcliffObershelp())` is a good default choice.
## References
- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo

View File

@ -12,6 +12,13 @@ using StringDistances, Base.Test
@test compare(Jaccard(2), "", "abc") 0.0 atol = 1e-4
@test compare(Jaccard(2), "martha", "martha") 1.0 atol = 1e-4
@test compare(Cosine(2), "martha", "martha") 1.0 atol = 1e-4
@test compare(Jaccard(2), "martha", "martha") 1.0 atol = 1e-4
@test compare(Overlap(2), "martha", "martha") 1.0 atol = 1e-4
@test compare(SorensenDice(2), "martha", "martha") 1.0 atol = 1e-4
# Winkler
@test compare(Winkler(Jaro(), 0.1, 0.0), "martha", "marhta") 0.9611 atol = 1e-4
@test compare(Winkler(Jaro(), 0.1, 0.0), "dwayne", "duane") 0.84 atol = 1e-4