StringDistances.jl/README.md

[![Build Status](https://travis-ci.org/matthieugomez/StringDistances.jl.svg?branch=master)](https://travis-ci.org/matthieugomez/StringDistances.jl)
[![Coverage Status](https://coveralls.io/repos/matthieugomez/StringDistances.jl/badge.svg?branch=master)](https://coveralls.io/r/matthieugomez/StringDistances.jl?branch=master)

This Julia package computes various distances between `AbstractString`s

## Installation
The package is registered in the [`General`](https://github.com/JuliaRegistries/General) registry and so can be installed at the REPL with `] add StringDistances`.

## Syntax
The function `compare` returns  a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar.

```julia
using StringDistances
compare("martha", "martha", Hamming())
#> 1.0
compare("martha", "marhta", Hamming())
#> 0.6666666666666667
```

## Distances

#### Edit Distances
- [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) `Hamming()`
- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`
- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp()`


#### Q-Grams Distances
Q-gram distances compare the set of all substrings of length `q` in each string.
- QGram Distance `Qgram(q)`
- [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity) `Cosine(q)`
- [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index) `Jaccard(q)`
- [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient) `Overlap(q)`
- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice(q)`

## Distance Modifiers
The package includes distance "modifiers", that can be applied to any distance.

- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes.  The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.

	```julia
	compare("martha", "marhta", Jaro())
	#> 0.9444444444444445
	compare("martha", "marhta", Winkler(Jaro()))
	#> 0.9611111111111111

	compare("william", "williams", QGram(2))
	#> 0.9230769230769231
	compare("william", "williams", Winkler(QGram(2)))
	#> 0.9538461538461539
	```

- Modifiers from the Python library [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/), that can be applied to any distance.

	- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the maximal similarity score between the shorter string and substrings of the longer string.

		```julia
		compare("New York Yankees", "Yankees", Levenshtein())
		#> 0.4375
		compare("New York Yankees", "Yankees", Partial(Levenshtein()))
		#> 1.0
		```

	- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically. 

		```julia
		compare("mariners vs angels", "angels vs mariners", RatcliffObershelp())
		#> 0.44444
		compare("mariners vs angels", "angels vs mariners", TokenSort(RatcliffObershelp())
		#> 1.0
		```

	- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.

		```julia
		compare("mariners vs angels", "los angeles angels at seattle mariners", Jaro())
		#> 0.559904
		compare("mariners vs angels", "los angeles angels at seattle mariners", TokenSet(Jaro()))
		#> 0.944444
		```


	- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) combines scores using the base distance, the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths.

		```julia
		compare("mariners vs angels", "los angeles angels at seattle mariners", TokenMax(RatcliffObershelp()))
		#> 0.855
		```

## Find
`find_best` returns the element of an iterator with the highest similarity score
```julia
find_best("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein())
#> "NewYork"
```

`find_all` returns all the elements of an iterator with a similarity score higher than a minimum value (default to 0.8)

```julia
find_all("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein(); min_score = 0.8)
#> 1-element Array{String,1}:
#> "NewYork"
```

While these functions are defined for any distance, they are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`)

## Compare vs Evaluate

The function `compare` returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function `evaluate` returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.

```julia
compare("New York", "New York", Levenshtein())
#> 1.0
evaluate(Levenshtein(), "New York", "New York")
#> 0
```

## Which distance should I use?

As a rule of thumb, 
- Standardize strings before comparing them (cases, whitespaces, accents, abbreviations...)
- The distance `Tokenmax(Levenshtein())` is a good choice to link sequence of words (adresses, names) across datasets (see [`fuzzywuzzy`](https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/))

## References
- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo
- [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
add badge 2015-10-22 18:38:04 +02:00			`[![Build Status](https://travis-ci.org/matthieugomez/StringDistances.jl.svg?branch=master)](https://travis-ci.org/matthieugomez/StringDistances.jl)`
add DamerauLevenshtein 2015-10-23 03:03:57 +02:00			`[![Coverage Status](https://coveralls.io/repos/matthieugomez/StringDistances.jl/badge.svg?branch=master)](https://coveralls.io/r/matthieugomez/StringDistances.jl?branch=master)`
add badge 2015-10-22 18:38:04 +02:00
Allow any AbstractString 2019-08-17 17:56:54 +02:00			This Julia package computes various distances between `AbstractString`s
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
fix travis 2019-12-11 22:12:24 +01:00			`## Installation`
			The package is registered in the [`General`](https://github.com/JuliaRegistries/General) registry and so can be installed at the REPL with `] add StringDistances`.

order 2018-05-16 00:48:26 +02:00			`## Syntax`
readme 2018-05-17 17:21:34 +02:00			The function `compare` returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar.
order 2018-05-16 00:48:26 +02:00
			```julia
String 2018-05-17 17:22:16 +02:00			`using StringDistances`
simplify code + allow distance in third arg 2019-08-15 17:07:12 +02:00			`compare("martha", "martha", Hamming())`
definition 2018-05-17 17:11:17 +02:00			`#> 1.0`
simplify code + allow distance in third arg 2019-08-15 17:07:12 +02:00			`compare("martha", "marhta", Hamming())`
order 2018-05-16 00:48:26 +02:00			`#> 0.6666666666666667`
			```

add winkler and normalized 2015-10-25 16:23:46 +01:00			`## Distances`
add badge 2015-10-22 18:38:04 +02:00
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`#### Edit Distances`
readme 2018-05-16 00:47:55 +02:00			- [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) `Hamming()`
classify 2018-05-17 17:17:28 +02:00			- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
Update README.md 2019-08-18 19:06:00 +02:00			- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`
clean 2019-08-18 18:52:37 +02:00			- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
simplify Radclikff 2019-08-17 18:57:35 +02:00			- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp()`
classify 2018-05-17 17:17:28 +02:00
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
			`#### Q-Grams Distances`
readme 2015-11-05 19:02:50 +01:00			Q-gram distances compare the set of all substrings of length `q` in each string.
readme 2018-05-16 00:47:55 +02:00			- QGram Distance `Qgram(q)`
			- [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity) `Cosine(q)`
			- [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index) `Jaccard(q)`
			- [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient) `Overlap(q)`
			- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice(q)`
add winkler and normalized 2015-10-25 16:23:46 +01:00
evaluate 2018-05-17 17:23:32 +02:00			`## Distance Modifiers`
readme 2018-05-16 00:43:12 +02:00			`The package includes distance "modifiers", that can be applied to any distance.`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
readme 2018-05-16 00:43:12 +02:00			`- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.`
add winkler and normalized 2015-10-25 16:23:46 +01:00
			```julia
simplify code + allow distance in third arg 2019-08-15 17:07:12 +02:00			`compare("martha", "marhta", Jaro())`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`#> 0.9444444444444445`
simplify code + allow distance in third arg 2019-08-15 17:07:12 +02:00			`compare("martha", "marhta", Winkler(Jaro()))`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`#> 0.9611111111111111`
add winkler and normalized 2015-10-25 16:23:46 +01:00
simplify code + allow distance in third arg 2019-08-15 17:07:12 +02:00			`compare("william", "williams", QGram(2))`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`#> 0.9230769230769231`
simplify code + allow distance in third arg 2019-08-15 17:07:12 +02:00			`compare("william", "williams", Winkler(QGram(2)))`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`#> 0.9538461538461539`
add winkler and normalized 2015-10-25 16:23:46 +01:00			```
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
add tests 2019-08-17 21:46:22 +02:00			`- Modifiers from the Python library [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/), that can be applied to any distance.`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
tokenmax 2015-11-06 16:55:08 +01:00			`- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the maximal similarity score between the shorter string and substrings of the longer string.`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
			```julia
simplify code + allow distance in third arg 2019-08-15 17:07:12 +02:00			`compare("New York Yankees", "Yankees", Levenshtein())`
more examples 2015-11-05 16:51:32 +01:00			`#> 0.4375`
simplify code + allow distance in third arg 2019-08-15 17:07:12 +02:00			`compare("New York Yankees", "Yankees", Partial(Levenshtein()))`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`#> 1.0`
			```

tokenmax 2015-11-06 16:55:08 +01:00			`- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically.`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
			```julia
simplify code + allow distance in third arg 2019-08-15 17:07:12 +02:00			`compare("mariners vs angels", "angels vs mariners", RatcliffObershelp())`
more examples 2015-11-05 16:51:32 +01:00			`#> 0.44444`
simplify code + allow distance in third arg 2019-08-15 17:07:12 +02:00			`compare("mariners vs angels", "angels vs mariners", TokenSort(RatcliffObershelp())`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`#> 1.0`
			```

tokenmax 2015-11-06 16:55:08 +01:00			`- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
			```julia
simplify code + allow distance in third arg 2019-08-15 17:07:12 +02:00			`compare("mariners vs angels", "los angeles angels at seattle mariners", Jaro())`
more examples 2015-11-05 16:51:32 +01:00			`#> 0.559904`
simplify code + allow distance in third arg 2019-08-15 17:07:12 +02:00			`compare("mariners vs angels", "los angeles angels at seattle mariners", TokenSet(Jaro()))`
more examples 2015-11-05 16:51:32 +01:00			`#> 0.944444`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			```


Update README.md 2019-08-14 16:44:09 +02:00			- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) combines scores using the base distance, the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths.
evaluate 2016-04-28 15:47:02 +02:00
add test 2015-11-06 16:47:15 +01:00			```julia
simplify code + allow distance in third arg 2019-08-15 17:07:12 +02:00			`compare("mariners vs angels", "los angeles angels at seattle mariners", TokenMax(RatcliffObershelp()))`
add test 2015-11-06 16:47:15 +01:00			`#> 0.855`
			```
update 2019-08-20 19:21:31 +02:00
Update README.md 2019-08-20 19:27:21 +02:00			`## Find`
Update README.md 2019-08-20 19:28:01 +02:00			`find_best` returns the element of an iterator with the highest similarity score
update 2019-08-20 19:21:31 +02:00			```julia
			`find_best("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein())`
			`#> "NewYork"`
			```

Update README.md 2019-08-20 19:28:01 +02:00			`find_all` returns all the elements of an iterator with a similarity score higher than a minimum value (default to 0.8)
update 2019-08-20 19:21:31 +02:00
add extract + handle Missing 2019-08-20 18:32:52 +02:00			```julia
update 2019-08-20 19:21:31 +02:00			`find_all("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein(); min_score = 0.8)`
			`#> 1-element Array{String,1}:`
Update README.md 2019-08-20 18:34:18 +02:00			`#> "NewYork"`
add extract + handle Missing 2019-08-20 18:32:52 +02:00			```
update 2019-08-20 19:21:31 +02:00
Update README.md 2019-08-20 19:22:49 +02:00			While these functions are defined for any distance, they are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`)
add extract + handle Missing 2019-08-20 18:32:52 +02:00
Update README.md 2019-08-20 19:24:29 +02:00			`## Compare vs Evaluate`

fix travis 2019-12-11 22:12:24 +01:00			The function `compare` returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function `evaluate` returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.
Update README.md 2019-08-20 19:24:29 +02:00
			```julia
			`compare("New York", "New York", Levenshtein())`
			`#> 1.0`
			`evaluate(Levenshtein(), "New York", "New York")`
			`#> 0`
			```

fix travis 2019-12-11 22:12:24 +01:00			`## Which distance should I use?`

			`As a rule of thumb,`
			`- Standardize strings before comparing them (cases, whitespaces, accents, abbreviations...)`
Update README.md 2019-12-12 17:00:42 +01:00			- The distance `Tokenmax(Levenshtein())` is a good choice to link sequence of words (adresses, names) across datasets (see [`fuzzywuzzy`](https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/))
fix travis 2019-12-11 22:12:24 +01:00
more examples 2015-11-05 16:51:32 +01:00			`## References`
readme 2015-11-05 19:02:50 +01:00			`- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo`
test on 1.0 2018-08-19 01:44:10 +02:00			`- [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)`
more examples 2015-11-05 16:51:32 +01:00