StringDistances.jl/README.md

79 lines
4.9 KiB
Markdown
Raw Normal View History

2015-10-22 18:38:04 +02:00
[![Build Status](https://travis-ci.org/matthieugomez/StringDistances.jl.svg?branch=master)](https://travis-ci.org/matthieugomez/StringDistances.jl)
2015-10-23 03:03:57 +02:00
[![Coverage Status](https://coveralls.io/repos/matthieugomez/StringDistances.jl/badge.svg?branch=master)](https://coveralls.io/r/matthieugomez/StringDistances.jl?branch=master)
2015-10-22 18:38:04 +02:00
2019-08-17 17:56:54 +02:00
This Julia package computes various distances between `AbstractString`s
2015-11-04 18:40:30 +01:00
2019-12-11 22:12:24 +01:00
## Installation
The package is registered in the [`General`](https://github.com/JuliaRegistries/General) registry and so can be installed at the REPL with `] add StringDistances`.
2019-12-12 20:48:52 +01:00
## Compare
2019-12-12 19:21:36 +01:00
The function `compare` returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. Its syntax is:
2018-05-16 00:48:26 +02:00
```julia
2019-12-12 20:48:52 +01:00
compare(::AbstractString, ::AbstractString, ::StringDistance)
2018-05-16 00:48:26 +02:00
```
2019-12-12 19:21:36 +01:00
- Edit Distances
- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`
- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp()`
- Q-gram distances compare the set of all substrings of length `q` in each string.
- QGram Distance `Qgram(q)`
- [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity) `Cosine(q)`
- [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index) `Jaccard(q)`
- [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient) `Overlap(q)`
- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice(q)`
- The package includes distance "modifiers", that can be applied to any distance.
- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the maximal similarity score between the shorter string and substrings of the longer string.
- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically.
- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) combines scores using the base distance, the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths.
2015-10-25 16:23:46 +01:00
2019-12-12 20:48:52 +01:00
Some examples:
2019-08-20 19:21:31 +02:00
```julia
2019-12-12 20:48:52 +01:00
compare("martha", "marhta", Jaro())
compare("martha", "marhta", Winkler(Jaro()))
compare("william", "williams", QGram(2))
compare("william", "williams", Winkler(QGram(2)))
compare("New York Yankees", "Yankees", Levenshtein())
compare("New York Yankees", "Yankees", Partial(Levenshtein()))
compare("mariners vs angels", "los angeles angels at seattle mariners", Jaro())
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenSet(Jaro()))
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenMax(RatcliffObershelp()))
2019-08-20 19:21:31 +02:00
```
2019-12-12 22:49:20 +01:00
In case the word order does not matter, a good distance is `TokenMax(Levenshtein())`
2019-08-20 19:21:31 +02:00
2019-12-12 20:48:52 +01:00
## Find
2019-12-12 22:49:20 +01:00
- `findmax` returns the value and index of the element in `itr` with the highest similarity score with `x`. Its syntax is:
2019-12-12 20:48:52 +01:00
```julia
2019-12-12 22:49:20 +01:00
findmax(x::AbstractString, itr, dist::StringDistance; min_score = 0.0)
2019-12-12 20:48:52 +01:00
```
2019-08-20 19:21:31 +02:00
2019-12-12 22:49:20 +01:00
- `findall` returns the indices of all elements in `itr` with a similarity score with `x` higher than a minimum value (default to 0.8). Its syntax is:
2019-12-12 20:48:52 +01:00
```julia
2019-12-12 22:49:20 +01:00
findall(x::AbstractString, itr, dist::StringDistance; min_score = 0.8)
2019-12-12 20:48:52 +01:00
```
2019-08-20 18:32:52 +02:00
2019-12-12 20:48:52 +01:00
The functions `findmax` and `findall` are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`).
2019-08-20 19:24:29 +02:00
2019-12-12 20:48:52 +01:00
## Evaluate
2019-12-11 22:12:24 +01:00
The function `compare` returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function `evaluate` returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.
2019-08-20 19:24:29 +02:00
```julia
compare("New York", "New York", Levenshtein())
#> 1.0
evaluate(Levenshtein(), "New York", "New York")
#> 0
```
2015-11-05 16:51:32 +01:00
## References
2015-11-05 19:02:50 +01:00
- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo
2018-08-19 01:44:10 +02:00
- [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
2015-11-05 16:51:32 +01:00