StringDistances.jl/README.md

131 lines
5.9 KiB
Markdown
Raw Normal View History

2015-10-22 18:38:04 +02:00
[![Build Status](https://travis-ci.org/matthieugomez/StringDistances.jl.svg?branch=master)](https://travis-ci.org/matthieugomez/StringDistances.jl)
2015-10-23 03:03:57 +02:00
[![Coverage Status](https://coveralls.io/repos/matthieugomez/StringDistances.jl/badge.svg?branch=master)](https://coveralls.io/r/matthieugomez/StringDistances.jl?branch=master)
2015-10-22 18:38:04 +02:00
2019-08-17 17:56:54 +02:00
This Julia package computes various distances between `AbstractString`s
2015-11-04 18:40:30 +01:00
2019-12-11 22:12:24 +01:00
## Installation
The package is registered in the [`General`](https://github.com/JuliaRegistries/General) registry and so can be installed at the REPL with `] add StringDistances`.
2018-05-16 00:48:26 +02:00
## Syntax
2018-05-17 17:21:34 +02:00
The function `compare` returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar.
2018-05-16 00:48:26 +02:00
```julia
2018-05-17 17:22:16 +02:00
using StringDistances
compare("martha", "martha", Hamming())
2018-05-17 17:11:17 +02:00
#> 1.0
compare("martha", "marhta", Hamming())
2018-05-16 00:48:26 +02:00
#> 0.6666666666666667
```
2015-10-25 16:23:46 +01:00
## Distances
2015-10-22 18:38:04 +02:00
2015-11-04 18:40:30 +01:00
#### Edit Distances
2018-05-16 00:47:55 +02:00
- [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) `Hamming()`
2018-05-17 17:17:28 +02:00
- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
2019-08-18 19:06:00 +02:00
- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`
2019-08-18 18:52:37 +02:00
- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
2019-08-17 18:57:35 +02:00
- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp()`
2018-05-17 17:17:28 +02:00
2015-11-04 18:40:30 +01:00
#### Q-Grams Distances
2015-11-05 19:02:50 +01:00
Q-gram distances compare the set of all substrings of length `q` in each string.
2018-05-16 00:47:55 +02:00
- QGram Distance `Qgram(q)`
- [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity) `Cosine(q)`
- [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index) `Jaccard(q)`
- [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient) `Overlap(q)`
- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice(q)`
2015-10-25 16:23:46 +01:00
2018-05-17 17:23:32 +02:00
## Distance Modifiers
2018-05-16 00:43:12 +02:00
The package includes distance "modifiers", that can be applied to any distance.
2015-11-04 18:40:30 +01:00
2018-05-16 00:43:12 +02:00
- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
2015-10-25 16:23:46 +01:00
```julia
compare("martha", "marhta", Jaro())
2015-11-04 18:40:30 +01:00
#> 0.9444444444444445
compare("martha", "marhta", Winkler(Jaro()))
2015-11-04 18:40:30 +01:00
#> 0.9611111111111111
2015-10-25 16:23:46 +01:00
compare("william", "williams", QGram(2))
2015-11-04 18:40:30 +01:00
#> 0.9230769230769231
compare("william", "williams", Winkler(QGram(2)))
2015-11-04 18:40:30 +01:00
#> 0.9538461538461539
2015-10-25 16:23:46 +01:00
```
2015-11-04 18:40:30 +01:00
2019-08-17 21:46:22 +02:00
- Modifiers from the Python library [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/), that can be applied to any distance.
2015-11-04 18:40:30 +01:00
2015-11-06 16:55:08 +01:00
- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the maximal similarity score between the shorter string and substrings of the longer string.
2015-11-04 18:40:30 +01:00
```julia
compare("New York Yankees", "Yankees", Levenshtein())
2015-11-05 16:51:32 +01:00
#> 0.4375
compare("New York Yankees", "Yankees", Partial(Levenshtein()))
2015-11-04 18:40:30 +01:00
#> 1.0
```
2015-11-06 16:55:08 +01:00
- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically.
2015-11-04 18:40:30 +01:00
```julia
compare("mariners vs angels", "angels vs mariners", RatcliffObershelp())
2015-11-05 16:51:32 +01:00
#> 0.44444
compare("mariners vs angels", "angels vs mariners", TokenSort(RatcliffObershelp())
2015-11-04 18:40:30 +01:00
#> 1.0
```
2015-11-06 16:55:08 +01:00
- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
2015-11-04 18:40:30 +01:00
```julia
compare("mariners vs angels", "los angeles angels at seattle mariners", Jaro())
2015-11-05 16:51:32 +01:00
#> 0.559904
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenSet(Jaro()))
2015-11-05 16:51:32 +01:00
#> 0.944444
2015-11-04 18:40:30 +01:00
```
2019-08-14 16:44:09 +02:00
- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) combines scores using the base distance, the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths.
2016-04-28 15:47:02 +02:00
2015-11-06 16:47:15 +01:00
```julia
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenMax(RatcliffObershelp()))
2015-11-06 16:47:15 +01:00
#> 0.855
```
2019-08-20 19:21:31 +02:00
2019-08-20 19:27:21 +02:00
## Find
2019-08-20 19:28:01 +02:00
`find_best` returns the element of an iterator with the highest similarity score
2019-08-20 19:21:31 +02:00
```julia
find_best("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein())
#> "NewYork"
```
2019-08-20 19:28:01 +02:00
`find_all` returns all the elements of an iterator with a similarity score higher than a minimum value (default to 0.8)
2019-08-20 19:21:31 +02:00
2019-08-20 18:32:52 +02:00
```julia
2019-08-20 19:21:31 +02:00
find_all("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein(); min_score = 0.8)
#> 1-element Array{String,1}:
2019-08-20 18:34:18 +02:00
#> "NewYork"
2019-08-20 18:32:52 +02:00
```
2019-08-20 19:21:31 +02:00
2019-08-20 19:22:49 +02:00
While these functions are defined for any distance, they are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`)
2019-08-20 18:32:52 +02:00
2019-08-20 19:24:29 +02:00
## Compare vs Evaluate
2019-12-11 22:12:24 +01:00
The function `compare` returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function `evaluate` returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.
2019-08-20 19:24:29 +02:00
```julia
compare("New York", "New York", Levenshtein())
#> 1.0
evaluate(Levenshtein(), "New York", "New York")
#> 0
```
2019-12-11 22:12:24 +01:00
## Which distance should I use?
As a rule of thumb,
- Standardize strings before comparing them (cases, whitespaces, accents, abbreviations...)
2019-12-12 17:00:42 +01:00
- The distance `Tokenmax(Levenshtein())` is a good choice to link sequence of words (adresses, names) across datasets (see [`fuzzywuzzy`](https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/))
2019-12-11 22:12:24 +01:00
2015-11-05 16:51:32 +01:00
## References
2015-11-05 19:02:50 +01:00
- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo
2018-08-19 01:44:10 +02:00
- [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
2015-11-05 16:51:32 +01:00