pull/3/head v0.1.1
matthieugomez 2016-08-31 19:55:29 -04:00
parent d0e10ea9ff
commit 90f6865120
1 changed files with 10 additions and 7 deletions

View File

@ -40,26 +40,23 @@ compare(QGram(2), "martha", "marhta")
To return the *litteral distance* between two strings, use `evaluate`
## Modifiers
The package includes distance modifiers:
The package includes distance "modifiers", that can be applied to any distance. Read below for more details.
- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes
- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
compare(Jaro(), "martha", "marhta")
#> 0.9444444444444445
compare(Winkler(Jaro()), "martha", "marhta")
#> 0.9611111111111111
The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
compare(QGram(2), "william", "williams")
#> 0.9230769230769231
compare(Winkler(QGram(2)), "william", "williams")
#> 0.9538461538461539
- The Python library [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) defines a few modifiers for the `RatcliffObershelp` similarity score. This package replicates them and extends them to any string distance:
- Modifiers from the Python library [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) .
- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the maximal similarity score between the shorter string and substrings of the longer string.
@ -96,7 +93,13 @@ The package includes distance modifiers:
#> 0.855
## Which distance should I use?
It depends on your specific problem. As a rule of thumb,
- standardize strings before comparing them (lowercase, punctuation, whitespaces, accents, abbreviations...)
- if the order of words does not matter, avoid edit distances.
## References
- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo
- [fuzzywuzzy blog post](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)