StringDistances.jl/README.md

[![StringDistances](http://pkg.julialang.org/badges/StringDistances_0.7.svg)](http://pkg.julialang.org/?pkg=StringDistances)
[![Build Status](https://travis-ci.org/matthieugomez/StringDistances.jl.svg?branch=master)](https://travis-ci.org/matthieugomez/StringDistances.jl)
[![Coverage Status](https://coveralls.io/repos/matthieugomez/StringDistances.jl/badge.svg?branch=master)](https://coveralls.io/r/matthieugomez/StringDistances.jl?branch=master)

This Julia package computes various distances between strings (ASCII)

## Syntax
The function `compare` returns  a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar.


```julia
using StringDistances
compare(Hamming(), "martha", "martha")
#> 1.0
compare(Hamming(), "martha", "marhta")
#> 0.6666666666666667
```


## Distances

#### Edit Distances
- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
- [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) `Hamming()`
- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`


#### Q-Grams Distances
Q-gram distances compare the set of all substrings of length `q` in each string.
- QGram Distance `Qgram(q)`
- [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity) `Cosine(q)`
- [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index) `Jaccard(q)`
- [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient) `Overlap(q)`
- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice(q)`

#### Others
- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp()`


## Distance Modifiers
The package includes distance "modifiers", that can be applied to any distance.

- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes.  The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.

	```julia
	compare(Jaro(), "martha", "marhta")
	#> 0.9444444444444445
	compare(Winkler(Jaro()), "martha", "marhta")
	#> 0.9611111111111111

	compare(QGram(2), "william", "williams")
	#> 0.9230769230769231
	compare(Winkler(QGram(2)), "william", "williams")
	#> 0.9538461538461539
	```

- Modifiers from the Python library [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) .

	- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the maximal similarity score between the shorter string and substrings of the longer string.

		```julia
		compare(Levenshtein(), "New York Yankees", "Yankees")
		#> 0.4375
		compare(Partial(Levenshtein()), "New York Yankees", "Yankees")
		#> 1.0
		```

	- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically. 

		```julia
		compare(RatcliffObershelp(), "mariners vs angels", "angels vs mariners")
		#> 0.44444
		compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")
		#> 1.0
		```

	- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.

		```julia
		compare(Jaro(),"mariners vs angels", "los angeles angels at seattle mariners")
		#> 0.559904
		compare(TokenSet(Jaro()),"mariners vs angels", "los angeles angels at seattle mariners")
		#> 0.944444
		```


	- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) combines scores using the base distance, the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths.

		```julia
		compare(TokenMax(RatcliffObershelp()),"mariners vs angels", "los angeles angels at seattle mariners")
		#> 0.855
		```
## Compare vs Evaluate
The function `compare` returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar.

In contrast, the function `evaluate` returns the litteral distance between two strings, with a value of 0 being completely similar.

```julia
compare(Levenshtein(), "New York", "New York")
#> 1.0
evaluate(Levenshtein(), "New York", "New York")
#> 0
```

## Which distance should I use?

As a rule of thumb, 
- Standardize strings before comparing them (correct for uppercases, punctuations, whitespaces, accents, abbreviations...)
- Don't use Edit Distances if word order do not matter.
- The distance `Tokenmax(RatcliffObershelp())` is a good default choice.

## References
- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo
- [fuzzywuzzy blog post](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
change travis and coverage 2018-07-07 17:27:01 +02:00			`[![StringDistances](http://pkg.julialang.org/badges/StringDistances_0.7.svg)](http://pkg.julialang.org/?pkg=StringDistances)`
add badge 2015-10-22 18:38:04 +02:00			`[![Build Status](https://travis-ci.org/matthieugomez/StringDistances.jl.svg?branch=master)](https://travis-ci.org/matthieugomez/StringDistances.jl)`
add DamerauLevenshtein 2015-10-23 03:03:57 +02:00			`[![Coverage Status](https://coveralls.io/repos/matthieugomez/StringDistances.jl/badge.svg?branch=master)](https://coveralls.io/r/matthieugomez/StringDistances.jl?branch=master)`
add badge 2015-10-22 18:38:04 +02:00
speficy ASCII 2018-05-21 17:56:09 +02:00			`This Julia package computes various distances between strings (ASCII)`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
order 2018-05-16 00:48:26 +02:00			`## Syntax`
readme 2018-05-17 17:21:34 +02:00			The function `compare` returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar.
order 2018-05-16 00:48:26 +02:00

			```julia
String 2018-05-17 17:22:16 +02:00			`using StringDistances`
definition 2018-05-17 17:11:17 +02:00			`compare(Hamming(), "martha", "martha")`
			`#> 1.0`
order 2018-05-16 00:48:26 +02:00			`compare(Hamming(), "martha", "marhta")`
			`#> 0.6666666666666667`
			```

add badge 2015-10-22 18:38:04 +02:00
add winkler and normalized 2015-10-25 16:23:46 +01:00
			`## Distances`
add badge 2015-10-22 18:38:04 +02:00
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`#### Edit Distances`
classify 2018-05-17 17:17:28 +02:00			- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
readme 2018-05-16 00:47:55 +02:00			- [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) `Hamming()`
classify 2018-05-17 17:17:28 +02:00			- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
readme 2018-05-16 00:47:55 +02:00			- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`
classify 2018-05-17 17:17:28 +02:00
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
			`#### Q-Grams Distances`
readme 2015-11-05 19:02:50 +01:00			Q-gram distances compare the set of all substrings of length `q` in each string.
readme 2018-05-16 00:47:55 +02:00			- QGram Distance `Qgram(q)`
			- [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity) `Cosine(q)`
			- [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index) `Jaccard(q)`
			- [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient) `Overlap(q)`
			- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice(q)`
add winkler and normalized 2015-10-25 16:23:46 +01:00
more examples 2015-11-05 16:51:32 +01:00			`#### Others`
readme 2018-05-16 00:47:55 +02:00			- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp()`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00

evaluate 2016-04-28 15:47:02 +02:00
evaluate 2018-05-17 17:23:32 +02:00			`## Distance Modifiers`
readme 2018-05-16 00:43:12 +02:00			`The package includes distance "modifiers", that can be applied to any distance.`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
readme 2018-05-16 00:43:12 +02:00			`- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.`
add winkler and normalized 2015-10-25 16:23:46 +01:00
			```julia
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`compare(Jaro(), "martha", "marhta")`
			`#> 0.9444444444444445`
			`compare(Winkler(Jaro()), "martha", "marhta")`
			`#> 0.9611111111111111`
add winkler and normalized 2015-10-25 16:23:46 +01:00
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`compare(QGram(2), "william", "williams")`
			`#> 0.9230769230769231`
			`compare(Winkler(QGram(2)), "william", "williams")`
			`#> 0.9538461538461539`
add winkler and normalized 2015-10-25 16:23:46 +01:00			```
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
strings 2016-09-01 01:55:29 +02:00			`- Modifiers from the Python library [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) .`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
tokenmax 2015-11-06 16:55:08 +01:00			`- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the maximal similarity score between the shorter string and substrings of the longer string.`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
			```julia
more examples 2015-11-05 16:51:32 +01:00			`compare(Levenshtein(), "New York Yankees", "Yankees")`
			`#> 0.4375`
			`compare(Partial(Levenshtein()), "New York Yankees", "Yankees")`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`#> 1.0`
			```

tokenmax 2015-11-06 16:55:08 +01:00			`- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically.`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
			```julia
more examples 2015-11-05 16:51:32 +01:00			`compare(RatcliffObershelp(), "mariners vs angels", "angels vs mariners")`
			`#> 0.44444`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")`
			`#> 1.0`
			```

tokenmax 2015-11-06 16:55:08 +01:00			`- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
			```julia
more examples 2015-11-05 16:51:32 +01:00			`compare(Jaro(),"mariners vs angels", "los angeles angels at seattle mariners")`
			`#> 0.559904`
			`compare(TokenSet(Jaro()),"mariners vs angels", "los angeles angels at seattle mariners")`
			`#> 0.944444`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			```


tokenmax 2015-11-06 16:55:08 +01:00			- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) combines scores using the base distance, the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths.
evaluate 2016-04-28 15:47:02 +02:00
add test 2015-11-06 16:47:15 +01:00			```julia
tokenmax 2015-11-06 16:55:08 +01:00			`compare(TokenMax(RatcliffObershelp()),"mariners vs angels", "los angeles angels at seattle mariners")`
add test 2015-11-06 16:47:15 +01:00			`#> 0.855`
			```
return 2018-05-17 17:27:31 +02:00			`## Compare vs Evaluate`
			The function `compare` returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar.
score 2018-05-17 17:26:36 +02:00
return 2018-05-17 17:27:31 +02:00			In contrast, the function `evaluate` returns the litteral distance between two strings, with a value of 0 being completely similar.
compare 2018-05-17 17:24:48 +02:00
julia 2018-05-17 17:26:04 +02:00			```julia
evaluate examples 2018-05-17 17:25:51 +02:00			`compare(Levenshtein(), "New York", "New York")`
			`#> 1.0`
return 2018-05-17 17:27:31 +02:00			`evaluate(Levenshtein(), "New York", "New York")`
			`#> 0`
evaluate examples 2018-05-17 17:25:51 +02:00			```
add test 2015-11-06 16:47:15 +01:00
strings 2016-09-01 01:55:29 +02:00			`## Which distance should I use?`

simplify len + correct Jaro 2017-08-05 20:45:19 +02:00			`As a rule of thumb,`
Correct compare for QGram+ solve collect bug 2018-05-15 17:27:38 +02:00			`- Standardize strings before comparing them (correct for uppercases, punctuations, whitespaces, accents, abbreviations...)`
readme 2018-05-16 00:43:12 +02:00			`- Don't use Edit Distances if word order do not matter.`
readme 2018-05-16 00:47:55 +02:00			- The distance `Tokenmax(RatcliffObershelp())` is a good default choice.
strings 2016-09-01 01:55:29 +02:00
more examples 2015-11-05 16:51:32 +01:00			`## References`
readme 2015-11-05 19:02:50 +01:00			`- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo`
			`- [fuzzywuzzy blog post](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)`
more examples 2015-11-05 16:51:32 +01:00