StringDistances.jl/README.md

81 lines
4.4 KiB
Markdown
Raw Permalink Normal View History

2015-10-22 18:38:04 +02:00
[![Build Status](https://travis-ci.org/matthieugomez/StringDistances.jl.svg?branch=master)](https://travis-ci.org/matthieugomez/StringDistances.jl)
2015-10-23 03:03:57 +02:00
[![Coverage Status](https://coveralls.io/repos/matthieugomez/StringDistances.jl/badge.svg?branch=master)](https://coveralls.io/r/matthieugomez/StringDistances.jl?branch=master)
2015-10-22 18:38:04 +02:00
2019-12-11 22:12:24 +01:00
## Installation
The package is registered in the [`General`](https://github.com/JuliaRegistries/General) registry and so can be installed at the REPL with `] add StringDistances`.
2020-03-03 12:48:00 +01:00
## Supported Distances
The available distances are:
2019-12-12 19:21:36 +01:00
- Edit Distances
- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`
- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp()`
- Q-gram distances compare the set of all substrings of length `q` in each string.
2019-12-13 15:14:36 +01:00
- QGram Distance `Qgram(q::Int)`
- [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity) `Cosine(q::Int)`
- [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index) `Jaccard(q::Int)`
- [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient) `Overlap(q::Int)`
- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice(q::Int)`
2020-03-03 12:48:00 +01:00
- Distance "modifiers" that can be applied to any distance:
2020-03-03 12:49:53 +01:00
- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) diminishes the distance of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but it can be defined for any string distance.
2020-02-12 15:58:03 +01:00
- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the minimum distance between the shorter string and substrings of the longer string.
2019-12-12 19:21:36 +01:00
- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically.
- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
2020-03-03 12:49:53 +01:00
- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) combines scores using the base distance, the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses. `TokenMax(Levenshtein())` corresponds to the distance defined in [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
2020-03-03 12:48:00 +01:00
## Basic Use
2020-03-03 12:48:28 +01:00
### Evaluate
2020-03-03 12:52:39 +01:00
You can always compute a certain distance between two strings using the following syntax:
2020-03-03 12:48:00 +01:00
2019-08-20 19:21:31 +02:00
```julia
2020-03-03 12:48:00 +01:00
evaluate(dist, s1, s2)
dist(s1, s2)
```
2019-08-20 19:21:31 +02:00
2020-03-03 12:48:00 +01:00
For instance, with the `Levenshtein` distance,
2020-03-03 12:43:42 +01:00
2020-03-03 12:48:00 +01:00
```julia
evaluate(Levenshtein(), "martha", "marhta")
2020-03-03 12:43:42 +01:00
Levenshtein()("martha", "marhta")
2020-02-12 15:41:46 +01:00
```
2020-02-12 15:58:03 +01:00
2020-03-03 12:53:08 +01:00
You can also compute a distance between two iterators:
2020-03-03 12:52:25 +01:00
```julia
evaluate(Levenshtein(), [1, 5, 6], [1, 6, 5])
2
```
2020-03-03 12:48:28 +01:00
### Compare
2020-03-03 12:51:05 +01:00
The function `compare` is defined as 1 minus the normalized distance between two strings. It always returns a `Float64` between 0 and 1: a value of 0 means completely different and a value of 1 means completely similar.
2020-03-03 12:48:00 +01:00
2020-02-09 19:42:29 +01:00
```julia
2020-03-03 12:51:05 +01:00
evaluate(Levenshtein(), "martha", "martha")
2020-02-09 19:42:29 +01:00
#> 0
2020-03-03 12:51:05 +01:00
compare("martha", "martha", Levenshtein())
2020-02-09 19:42:29 +01:00
#> 1.0
```
2020-03-03 12:48:28 +01:00
### Find
2019-12-13 15:14:36 +01:00
- `findmax` returns the value and index of the element in `itr` with the highest similarity score with `s`. Its syntax is:
2019-12-12 20:48:52 +01:00
```julia
2020-02-09 19:42:29 +01:00
findmax(s, itr, dist::StringDistance; min_score = 0.0)
2019-12-12 20:48:52 +01:00
```
2019-08-20 19:21:31 +02:00
2019-12-13 15:14:36 +01:00
- `findall` returns the indices of all elements in `itr` with a similarity score with `s` higher than a minimum value (default to 0.8). Its syntax is:
2019-12-12 20:48:52 +01:00
```julia
2020-02-09 19:42:29 +01:00
findall(s, itr, dist::StringDistance; min_score = 0.8)
2019-12-12 20:48:52 +01:00
```
2019-08-20 18:32:52 +02:00
2019-12-12 20:48:52 +01:00
The functions `findmax` and `findall` are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`).
2019-08-20 19:24:29 +02:00
2015-11-05 16:51:32 +01:00
## References
2015-11-05 19:02:50 +01:00
- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo
2018-08-19 01:44:10 +02:00
- [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
2015-11-05 16:51:32 +01:00