StringDistances.jl

Go to file

matthieugomez 99d77a585b comments		2015-11-05 10:46:48 -05:00
benchmark	add RatcliffObershelp	2015-11-05 10:32:58 -05:00
src	comments	2015-11-05 10:46:48 -05:00
test	add RatcliffObershelp	2015-11-05 10:32:58 -05:00
.travis.yml	first commit	2015-10-22 12:12:44 -04:00
LICENSE.md	first commit	2015-10-22 12:12:44 -04:00
README.md	add RatcliffObershelp	2015-11-05 10:32:58 -05:00
REQUIRE	add RatcliffObershelp	2015-11-05 10:32:58 -05:00

README.md

This Julia package computes various distances between strings.

Distances

Edit Distances

Hamming Distance
Jaro Distance
Levenshtein Distance
Damerau-Levenshtein Distance
RatcliffObershelp Distance (similar to the Python library difflib)

Q-Grams Distances

QGram Distance
Cosine Distance
Jaccard Distance

A good reference for q-gram distances is the article written for the R package stringdist: The stringdist Package for Approximate String Matching Mark P.J. van der Loo

Syntax

evaluate

The function evaluate returns the litteral distance between two strings (a value of 0 being identical). While some distances are bounded by 1, other distances like Hamming, Levenshtein, Damerau-Levenshtein, Jaccard can be higher than 1.

using StringDistances
evaluate(Hamming(), "martha", "marhta")
#> 2
evaluate(QGram(2), "martha", "marhta")
#> 6

compare

The higher level function compare directly computes for any distance a similarity score between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.

using StringDistances
compare(Hamming(), "martha", "marhta")
#> 0.6666666666666667
compare(QGram(2), "martha", "marhta")
#> 0.4

Modifiers

The package defines a number of types to modify string metrics:

Winkler boosts the similary score of strings with common prefixes

compare(Jaro(), "martha", "marhta")
#> 0.9444444444444445
compare(Winkler(Jaro()), "martha", "marhta")
#> 0.9611111111111111

The Winkler adjustment was originally defined for the Jaro distance but this package defines it for any string distance.

compare(QGram(2), "william", "williams")
#> 0.9230769230769231
compare(Winkler(QGram(2)), "william", "williams")
#> 0.9538461538461539

For strings composed of several words, the Python library fuzzywuzzy defines a few modifiers for the RatcliffObershelp distance. This package defines them for any string distance:
- Partial adjusts for differences in string lengths. The function returns the maximal similarity score between the shorter string and all substrings of the longer string.
```
compare(Partial(Hamming()), "New York Yankees", "Yankees")
#> 1.0
```
- TokenSort adjusts for differences in word orders by reording words alphabetically.
```
compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")
#> 1.0
```
- TokenSet adjusts for differences in word orders and word numbers.
```
compare(TokenSet(RatcliffObershelp()),"mariners vs angels", "los angeles angels of anaheim at seattle mariners")
```