StringDistances.jl/README.md

87 lines
5.0 KiB
Markdown
Raw Normal View History

2021-04-05 22:55:09 +02:00
[![Build status](https://github.com/matthieugomez/StringDistances.jl/workflows/CI/badge.svg)](https://github.com/matthieugomez/StringDistances.jl/actions)
2015-10-22 18:38:04 +02:00
2019-12-11 22:12:24 +01:00
## Installation
The package is registered in the [`General`](https://github.com/JuliaRegistries/General) registry and so can be installed at the REPL with `] add StringDistances`.
2020-03-03 12:48:00 +01:00
## Supported Distances
2021-09-13 20:46:42 +02:00
String distances act over any pair of iterators that define `length` (e.g. `AbstractStrings`, `GraphemeIterators`, or `AbstractVectors`)
2020-07-13 17:59:33 +02:00
2020-03-03 12:48:00 +01:00
The available distances are:
2019-12-12 19:21:36 +01:00
- Edit Distances
2021-09-13 20:46:42 +02:00
- Hamming Distance `Hamming() <: SemiMetric`
- [Jaro and Jaro-Winkler Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()` `JaroWinkler() <: SemiMetric`
- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein() <: Metric`
- [Optimal String Alignement Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance) (a.k.a. restricted Damerau-Levenshtein) `OptimalStringAlignement() <: SemiMetric`
- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Distance_with_adjacent_transpositions) `DamerauLevenshtein() <: Metric`
- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp() <: SemiMetric`
- Q-gram distances compare the set of all substrings of length `q` in each string (and which
- QGram Distance `Qgram(q::Int) <: SemiMetric`
- [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity) `Cosine(q::Int) <: SemiMetric`
- [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index) `Jaccard(q::Int) <: SemiMetric`
- [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient) `Overlap(q::Int) <: SemiMetric`
- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice(q::Int) <: SemiMetric`
- [MorisitaOverlap Distance](https://en.wikipedia.org/wiki/Morisita%27s_overlap_index) `MorisitaOverlap(q::Int) <: SemiMetric`
- [Normalized Multiset Distance](https://www.sciencedirect.com/science/article/pii/S1047320313001417) `NMD(q::Int) <: SemiMetric`
2020-03-03 12:48:00 +01:00
2020-11-12 06:13:14 +01:00
2020-11-12 06:16:22 +01:00
## Basic Use
### distance
2021-09-13 20:46:42 +02:00
The distance between two strings can be computed using the following syntax:
2020-03-03 12:48:00 +01:00
2019-08-20 19:21:31 +02:00
```julia
2020-03-03 12:48:00 +01:00
dist(s1, s2)
```
2019-08-20 19:21:31 +02:00
2020-03-03 12:48:00 +01:00
For instance, with the `Levenshtein` distance,
2020-03-03 12:43:42 +01:00
2020-03-03 12:48:00 +01:00
```julia
2020-03-03 12:43:42 +01:00
Levenshtein()("martha", "marhta")
2020-02-12 15:41:46 +01:00
```
2020-02-12 15:58:03 +01:00
2020-11-10 16:14:13 +01:00
### pairwise
`pairwise` returns the matrix of distance between two `AbstractVectors` of AbstractStrings (or iterators)
2020-11-10 16:14:13 +01:00
```julia
2020-11-10 16:24:32 +01:00
pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])
2020-11-10 16:14:13 +01:00
```
2021-07-04 19:50:40 +02:00
The function `pairwise` is particularly optimized for QGram-distances (each element is processed only once).
2020-11-10 16:14:13 +01:00
2020-11-12 06:13:14 +01:00
2021-09-12 20:33:39 +02:00
### fuzzywuzzy
The package also defines Distance "modifiers" that are defined in the Python package - [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/). These modifiers are particularly helpful to match strings composed of multiple words (e.g. addresses, company names).
2020-11-12 06:13:14 +01:00
- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the minimum of the distance between the shorter string and substrings of the longer string.
- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by returning the distance of the two strings, after re-ordering words alphabetically.
- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
2021-09-12 20:33:39 +02:00
- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) normalizes the distance, and combine the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string. `TokenMax(Levenshtein())` corresponds to the distance defined in [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
2020-03-03 12:48:00 +01:00
### find
The package also adds some convience function to find the element in a list that is closest to a given string
2020-02-09 19:42:29 +01:00
2021-09-13 20:46:42 +02:00
- The function `compare` returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns an element of type `Float64`. A value of 0.0 means completely different and a value of 1.0 means completely similar.
```julia
compare("martha", "martha", Levenshtein())
#> 1.0
```
2020-11-12 06:16:22 +01:00
- `findnearest` returns the value and index of the element in `itr` with the highest similarity score with `s`. Its syntax is:
2019-12-12 20:48:52 +01:00
```julia
2020-11-12 06:13:14 +01:00
findnearest(s, itr, dist::StringDistance)
2019-12-12 20:48:52 +01:00
```
2019-08-20 19:21:31 +02:00
2020-11-12 06:16:22 +01:00
- `findall` returns the indices of all elements in `itr` with a similarity score with `s` higher than a minimum value (default to 0.8). Its syntax is:
2019-12-12 20:48:52 +01:00
```julia
2020-02-09 19:42:29 +01:00
findall(s, itr, dist::StringDistance; min_score = 0.8)
2019-12-12 20:48:52 +01:00
```
2019-08-20 18:32:52 +02:00
2021-09-10 23:55:37 +02:00
The functions `findnearest` and `findall` are particularly optimized for the `Levenshtein` and `OptimalStringAlignement` distances (these distances stop early if the distance is higher than a certain threshold).
2021-07-04 19:50:40 +02:00
2021-09-12 20:33:39 +02:00
## Notes
- All string lookups are case sensitive.
2015-11-05 16:51:32 +01:00