pull/3/head v0.1.1
matthieugomez 2016-08-31 19:55:29 -04:00
parent d0e10ea9ff
commit 90f6865120
1 changed files with 10 additions and 7 deletions

View File

@ -40,26 +40,23 @@ compare(QGram(2), "martha", "marhta")
To return the *litteral distance* between two strings, use `evaluate`
## Modifiers
The package includes distance modifiers:
The package includes distance "modifiers", that can be applied to any distance. Read below for more details.
- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes
- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
```julia
compare(Jaro(), "martha", "marhta")
#> 0.9444444444444445
compare(Winkler(Jaro()), "martha", "marhta")
#> 0.9611111111111111
```
The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
```julia
compare(QGram(2), "william", "williams")
#> 0.9230769230769231
compare(Winkler(QGram(2)), "william", "williams")
#> 0.9538461538461539
```
- The Python library [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) defines a few modifiers for the `RatcliffObershelp` similarity score. This package replicates them and extends them to any string distance:
- Modifiers from the Python library [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) .
- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the maximal similarity score between the shorter string and substrings of the longer string.
@ -96,7 +93,13 @@ The package includes distance modifiers:
#> 0.855
```
```
## Which distance should I use?
It depends on your specific problem. As a rule of thumb,
- standardize strings before comparing them (lowercase, punctuation, whitespaces, accents, abbreviations...)
- if the order of words does not matter, avoid edit distances.
## References
- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo
- [fuzzywuzzy blog post](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)