StringDistances.jl/README.md

[![Build Status](https://travis-ci.org/matthieugomez/StringDistances.jl.svg?branch=master)](https://travis-ci.org/matthieugomez/StringDistances.jl)
[![Coverage Status](https://coveralls.io/repos/matthieugomez/StringDistances.jl/badge.svg?branch=master)](https://coveralls.io/r/matthieugomez/StringDistances.jl?branch=master)

This Julia package computes various distances between AbstractStrings

## Installation
The package is registered in the [`General`](https://github.com/JuliaRegistries/General) registry and so can be installed at the REPL with `] add StringDistances`.

## Compare
The function `compare` returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. Its syntax is:

```julia
compare(s1::AbstractString, s2::AbstractString, dist::StringDistance)
```

- Edit Distances
	- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
	- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`
	- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
	- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp()`
- Q-gram distances compare the set of all substrings of length `q` in each string.
	- QGram Distance `Qgram(q::Int)`
	- [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity) `Cosine(q::Int)`
	- [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index) `Jaccard(q::Int)`
	- [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient) `Overlap(q::Int)`
	- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice(q::Int)`

- The package includes distance "modifiers", that can be applied to any distance.

	- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes.  The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
	- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the maximal similarity score between the shorter string and substrings of the longer string.
	- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically. 
	- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
	- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) combines scores using the base distance, the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths.

Some examples:
```julia
compare("martha", "marhta", Jaro())
compare("martha", "marhta", Winkler(Jaro()))
compare("martha", "marhta", QGram(2))
compare("martha", "marhta", Winkler(QGram(2)))
compare("martha", "marhta", Levenshtein())
compare("martha", "marhta", Partial(Levenshtein()))
compare("martha", "marhta", Jaro())
compare("martha", "marhta", TokenSet(Jaro()))
compare("martha", "marhta", TokenMax(RatcliffObershelp()))
```

A good distance to match strings composed of multiple words (like addresses) is `TokenMax(Levenshtein())` (see [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)).

## Find
- `findmax` returns the value and index of the element in `itr` with the highest similarity score with `s`. Its syntax is:
	```julia
	findmax(s::AbstractString, itr, dist::StringDistance; min_score = 0.0)
	```

- `findall` returns the indices of all elements in `itr` with a similarity score with `s` higher than a minimum value (default to 0.8). Its syntax is:
	```julia
	findall(s::AbstractString, itr, dist::StringDistance; min_score = 0.8)
	```

The functions `findmax` and `findall` are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`).

## Evaluate
The function `compare` returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function `evaluate` returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.

```julia
compare("New York", "New York", Levenshtein())
#> 1.0
evaluate(Levenshtein(), "New York", "New York")
#> 0
```

## References
- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo
- [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
add badge 2015-10-22 18:38:04 +02:00			`[![Build Status](https://travis-ci.org/matthieugomez/StringDistances.jl.svg?branch=master)](https://travis-ci.org/matthieugomez/StringDistances.jl)`
add DamerauLevenshtein 2015-10-23 03:03:57 +02:00			`[![Coverage Status](https://coveralls.io/repos/matthieugomez/StringDistances.jl/badge.svg?branch=master)](https://coveralls.io/r/matthieugomez/StringDistances.jl?branch=master)`
add badge 2015-10-22 18:38:04 +02:00
rmv datastructures + add docs 2019-12-13 16:33:06 +01:00			`This Julia package computes various distances between AbstractStrings`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
fix travis 2019-12-11 22:12:24 +01:00			`## Installation`
			The package is registered in the [`General`](https://github.com/JuliaRegistries/General) registry and so can be installed at the REPL with `] add StringDistances`.

remove Hamming, create StringDistance 2019-12-12 20:48:52 +01:00			`## Compare`
parellelize find functions 2019-12-12 19:21:36 +01:00			The function `compare` returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. Its syntax is:
order 2018-05-16 00:48:26 +02:00
			```julia
update 2019-12-13 15:14:36 +01:00			`compare(s1::AbstractString, s2::AbstractString, dist::StringDistance)`
order 2018-05-16 00:48:26 +02:00			```

parellelize find functions 2019-12-12 19:21:36 +01:00			`- Edit Distances`
			- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
			- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`
			- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
			- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp()`
			- Q-gram distances compare the set of all substrings of length `q` in each string.
update 2019-12-13 15:14:36 +01:00			- QGram Distance `Qgram(q::Int)`
			- [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity) `Cosine(q::Int)`
			- [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index) `Jaccard(q::Int)`
			- [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient) `Overlap(q::Int)`
			- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice(q::Int)`
parellelize find functions 2019-12-12 19:21:36 +01:00
			`- The package includes distance "modifiers", that can be applied to any distance.`

			`- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.`
			`- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the maximal similarity score between the shorter string and substrings of the longer string.`
			`- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically.`
			`- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.`
			- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) combines scores using the base distance, the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths.
add winkler and normalized 2015-10-25 16:23:46 +01:00
remove Hamming, create StringDistance 2019-12-12 20:48:52 +01:00			`Some examples:`
update 2019-08-20 19:21:31 +02:00			```julia
remove Hamming, create StringDistance 2019-12-12 20:48:52 +01:00			`compare("martha", "marhta", Jaro())`
			`compare("martha", "marhta", Winkler(Jaro()))`
update 2019-12-13 15:14:36 +01:00			`compare("martha", "marhta", QGram(2))`
			`compare("martha", "marhta", Winkler(QGram(2)))`
			`compare("martha", "marhta", Levenshtein())`
			`compare("martha", "marhta", Partial(Levenshtein()))`
			`compare("martha", "marhta", Jaro())`
			`compare("martha", "marhta", TokenSet(Jaro()))`
			`compare("martha", "marhta", TokenMax(RatcliffObershelp()))`
update 2019-08-20 19:21:31 +02:00			```

rmv datastructures + add docs 2019-12-13 16:33:06 +01:00			A good distance to match strings composed of multiple words (like addresses) is `TokenMax(Levenshtein())` (see [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)).
update 2019-08-20 19:21:31 +02:00
remove Hamming, create StringDistance 2019-12-12 20:48:52 +01:00			`## Find`
update 2019-12-13 15:14:36 +01:00			- `findmax` returns the value and index of the element in `itr` with the highest similarity score with `s`. Its syntax is:
remove Hamming, create StringDistance 2019-12-12 20:48:52 +01:00			```julia
update 2019-12-13 15:14:36 +01:00			`findmax(s::AbstractString, itr, dist::StringDistance; min_score = 0.0)`
remove Hamming, create StringDistance 2019-12-12 20:48:52 +01:00			```
update 2019-08-20 19:21:31 +02:00
update 2019-12-13 15:14:36 +01:00			- `findall` returns the indices of all elements in `itr` with a similarity score with `s` higher than a minimum value (default to 0.8). Its syntax is:
remove Hamming, create StringDistance 2019-12-12 20:48:52 +01:00			```julia
update 2019-12-13 15:14:36 +01:00			`findall(s::AbstractString, itr, dist::StringDistance; min_score = 0.8)`
remove Hamming, create StringDistance 2019-12-12 20:48:52 +01:00			```
add extract + handle Missing 2019-08-20 18:32:52 +02:00
remove Hamming, create StringDistance 2019-12-12 20:48:52 +01:00			The functions `findmax` and `findall` are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`).
Update README.md 2019-08-20 19:24:29 +02:00
remove Hamming, create StringDistance 2019-12-12 20:48:52 +01:00			`## Evaluate`
fix travis 2019-12-11 22:12:24 +01:00			The function `compare` returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function `evaluate` returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.
Update README.md 2019-08-20 19:24:29 +02:00
			```julia
			`compare("New York", "New York", Levenshtein())`
			`#> 1.0`
			`evaluate(Levenshtein(), "New York", "New York")`
			`#> 0`
			```

more examples 2015-11-05 16:51:32 +01:00			`## References`
readme 2015-11-05 19:02:50 +01:00			`- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo`
test on 1.0 2018-08-19 01:44:10 +02:00			`- [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)`
more examples 2015-11-05 16:51:32 +01:00