StringDistances.jl/README.md

[![Build Status](https://travis-ci.org/matthieugomez/StringDistances.jl.svg?branch=master)](https://travis-ci.org/matthieugomez/StringDistances.jl)
[![Coverage Status](https://coveralls.io/repos/matthieugomez/StringDistances.jl/badge.svg?branch=master)](https://coveralls.io/r/matthieugomez/StringDistances.jl?branch=master)
[![StringDistances](http://pkg.julialang.org/badges/StringDistances_0.4.svg)](http://pkg.julialang.org/?pkg=StringDistances)

This Julia package computes various distances between strings.


## Distances

#### Edit Distances
- [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance)
- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance)
- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)

#### Q-Grams Distances
Q-gram distances compare the set of all substrings of length `q` in each
- QGram Distance
- [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity)
- [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index)
- [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient)
- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient)

#### Others
- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)
- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) is based on the length of matching subsequences. It is used in the Python library [difflib](https://docs.python.org/2/library/difflib.html).

## Syntax
#### evaluate
The function `evaluate` returns the litteral distance between two strings (a value of 0 being identical). While some distances are bounded by 1, other distances like `Hamming`, `Levenshtein`, `Damerau-Levenshtein`,  `Jaccard` can be higher than 1.

```julia
using StringDistances
evaluate(Hamming(), "martha", "marhta")
#> 2
evaluate(QGram(2), "martha", "marhta")
#> 6
```

#### compare
The higher level function `compare` directly computes for any distance a similarity score between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.
```julia
using StringDistances
compare(Hamming(), "martha", "marhta")
#> 0.6666666666666667
compare(QGram(2), "martha", "marhta")
#> 0.4
```


## Modifiers

The package defines a number of ways to modify string metrics:

- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes

	```julia
	compare(Jaro(), "martha", "marhta")
	#> 0.9444444444444445
	compare(Winkler(Jaro()), "martha", "marhta")
	#> 0.9611111111111111
	```
	The Winkler adjustment was originally defined for the Jaro distance but this package defines it for any string distance.

	```julia
	compare(QGram(2), "william", "williams")
	#> 0.9230769230769231
	compare(Winkler(QGram(2)), "william", "williams")
	#> 0.9538461538461539
	```

- For strings composed of several words, the Python library [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) defines a few modifiers for the `RatcliffObershelp` distance. This package defines them for any string distance:

	- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in string lengths. The function returns the maximal similarity score between the shorter string and all substrings of the longer string. 	

		```julia
		compare(Levenshtein(), "New York Yankees", "Yankees")
		#> 0.4375
		compare(Partial(Levenshtein()), "New York Yankees", "Yankees")
		#> 1.0
		```

	- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically.

		```julia
		compare(RatcliffObershelp(), "mariners vs angels", "angels vs mariners")
		#> 0.44444
		compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")
		#> 1.0
		```

	- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers.

		```julia
		compare(Jaro(),"mariners vs angels", "los angeles angels at seattle mariners")
		#> 0.559904
		compare(TokenSet(Jaro()),"mariners vs angels", "los angeles angels at seattle mariners")
		#> 0.944444
		```


You can compose multiple modifiers:
```julia
compare(Winkler(Partial(Jaro())),"mariners vs angels", "los angeles angels at seattle mariners")
#> 0.7378917378917379
compare(TokenSet(Partial(RatcliffObershel())),"mariners vs angels", "los angeles angels at seattle mariners")
#> 1.0
```

## References
A good reference for some distances in this package is the article written for the R package `stringdist`:
*The stringdist Package for Approximate String Matching* Mark P.J. van der Loo
add badge 2015-10-22 18:38:04 +02:00			`[![Build Status](https://travis-ci.org/matthieugomez/StringDistances.jl.svg?branch=master)](https://travis-ci.org/matthieugomez/StringDistances.jl)`
add DamerauLevenshtein 2015-10-23 03:03:57 +02:00			`[![Coverage Status](https://coveralls.io/repos/matthieugomez/StringDistances.jl/badge.svg?branch=master)](https://coveralls.io/r/matthieugomez/StringDistances.jl?branch=master)`
Add pkg.julialang.org badge 2015-10-30 16:20:26 +01:00			`[![StringDistances](http://pkg.julialang.org/badges/StringDistances_0.4.svg)](http://pkg.julialang.org/?pkg=StringDistances)`
add badge 2015-10-22 18:38:04 +02:00
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`This Julia package computes various distances between strings.`

add badge 2015-10-22 18:38:04 +02:00
add winkler and normalized 2015-10-25 16:23:46 +01:00
			`## Distances`
add badge 2015-10-22 18:38:04 +02:00
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`#### Edit Distances`
more examples 2015-11-05 16:51:32 +01:00			`- [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance)`
			`- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance)`
			`- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
			`#### Q-Grams Distances`
more examples 2015-11-05 16:51:32 +01:00			Q-gram distances compare the set of all substrings of length `q` in each
stringdist 2015-11-03 16:55:37 +01:00			`- QGram Distance`
more examples 2015-11-05 16:51:32 +01:00			`- [Cosine Distance](https://en.wikipedia.org/wiki/Cosine_similarity)`
			`- [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index)`
			`- [Overlap Distance](https://en.wikipedia.org/wiki/Overlap_coefficient)`
			`- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient)`
add winkler and normalized 2015-10-25 16:23:46 +01:00
more examples 2015-11-05 16:51:32 +01:00			`#### Others`
			`- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)`
			`- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) is based on the length of matching subsequences. It is used in the Python library [difflib](https://docs.python.org/2/library/difflib.html).`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
add winkler and normalized 2015-10-25 16:23:46 +01:00			`## Syntax`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`#### evaluate`
			The function `evaluate` returns the litteral distance between two strings (a value of 0 being identical). While some distances are bounded by 1, other distances like `Hamming`, `Levenshtein`, `Damerau-Levenshtein`, `Jaccard` can be higher than 1.

			```julia
			`using StringDistances`
			`evaluate(Hamming(), "martha", "marhta")`
			`#> 2`
			`evaluate(QGram(2), "martha", "marhta")`
			`#> 6`
			```

			`#### compare`
			The higher level function `compare` directly computes for any distance a similarity score between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.
			```julia
			`using StringDistances`
			`compare(Hamming(), "martha", "marhta")`
			`#> 0.6666666666666667`
			`compare(QGram(2), "martha", "marhta")`
			`#> 0.4`
			```


			`## Modifiers`

more examples 2015-11-05 16:51:32 +01:00			`The package defines a number of ways to modify string metrics:`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
			`- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes`
add winkler and normalized 2015-10-25 16:23:46 +01:00
			```julia
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`compare(Jaro(), "martha", "marhta")`
			`#> 0.9444444444444445`
			`compare(Winkler(Jaro()), "martha", "marhta")`
			`#> 0.9611111111111111`
add winkler and normalized 2015-10-25 16:23:46 +01:00			```
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`The Winkler adjustment was originally defined for the Jaro distance but this package defines it for any string distance.`
add winkler and normalized 2015-10-25 16:23:46 +01:00
			```julia
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`compare(QGram(2), "william", "williams")`
			`#> 0.9230769230769231`
			`compare(Winkler(QGram(2)), "william", "williams")`
			`#> 0.9538461538461539`
add winkler and normalized 2015-10-25 16:23:46 +01:00			```
add RatcliffObershelp 2015-11-04 18:40:30 +01:00
			- For strings composed of several words, the Python library [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) defines a few modifiers for the `RatcliffObershelp` distance. This package defines them for any string distance:

			`- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in string lengths. The function returns the maximal similarity score between the shorter string and all substrings of the longer string.`

			```julia
more examples 2015-11-05 16:51:32 +01:00			`compare(Levenshtein(), "New York Yankees", "Yankees")`
			`#> 0.4375`
			`compare(Partial(Levenshtein()), "New York Yankees", "Yankees")`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`#> 1.0`
			```

			`- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically.`

			```julia
more examples 2015-11-05 16:51:32 +01:00			`compare(RatcliffObershelp(), "mariners vs angels", "angels vs mariners")`
			`#> 0.44444`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			`compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")`
			`#> 1.0`
			```

			`- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers.`

			```julia
more examples 2015-11-05 16:51:32 +01:00			`compare(Jaro(),"mariners vs angels", "los angeles angels at seattle mariners")`
			`#> 0.559904`
			`compare(TokenSet(Jaro()),"mariners vs angels", "los angeles angels at seattle mariners")`
			`#> 0.944444`
add RatcliffObershelp 2015-11-04 18:40:30 +01:00			```


more examples 2015-11-05 16:51:32 +01:00			`You can compose multiple modifiers:`
			```julia
			`compare(Winkler(Partial(Jaro())),"mariners vs angels", "los angeles angels at seattle mariners")`
			`#> 0.7378917378917379`
			`compare(TokenSet(Partial(RatcliffObershel())),"mariners vs angels", "los angeles angels at seattle mariners")`
			`#> 1.0`
			```

			`## References`
			A good reference for some distances in this package is the article written for the R package `stringdist`:
			`The stringdist Package for Approximate String Matching Mark P.J. van der Loo`