fix travis
parent
eb7933aa05
commit
fc5587a60c
|
@ -7,7 +7,7 @@ matrix:
|
|||
- julia: nightly
|
||||
after_success:
|
||||
- julia -e 'using Pkg; Pkg.add("Coverage"); using Coverage; Coveralls.submit(Coveralls.process_folder())'
|
||||
notifications:
|
||||
notifications:
|
||||
email: false
|
||||
on_success: never
|
||||
on_failure: change
|
22
README.md
22
README.md
|
@ -3,10 +3,12 @@
|
|||
|
||||
This Julia package computes various distances between `AbstractString`s
|
||||
|
||||
## Installation
|
||||
The package is registered in the [`General`](https://github.com/JuliaRegistries/General) registry and so can be installed at the REPL with `] add StringDistances`.
|
||||
|
||||
## Syntax
|
||||
The function `compare` returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar.
|
||||
|
||||
|
||||
```julia
|
||||
using StringDistances
|
||||
compare("martha", "martha", Hamming())
|
||||
|
@ -15,8 +17,6 @@ compare("martha", "marhta", Hamming())
|
|||
#> 0.6666666666666667
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Distances
|
||||
|
||||
#### Edit Distances
|
||||
|
@ -89,7 +89,6 @@ The package includes distance "modifiers", that can be applied to any distance.
|
|||
#> 0.855
|
||||
```
|
||||
|
||||
|
||||
## Find
|
||||
`find_best` returns the element of an iterator with the highest similarity score
|
||||
```julia
|
||||
|
@ -107,16 +106,9 @@ find_all("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein(); min_
|
|||
|
||||
While these functions are defined for any distance, they are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`)
|
||||
|
||||
## Which distance should I use?
|
||||
|
||||
As a rule of thumb,
|
||||
- Standardize strings before comparing them (cases, whitespaces, accents, abbreviations...)
|
||||
- The distance `Tokenmax(Levenshtein())` is a good choice to link names or adresses across datasets.
|
||||
|
||||
## Compare vs Evaluate
|
||||
|
||||
The function `compare` returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar.
|
||||
In contrast, the function `evaluate` returns the litteral distance between two strings, with a value of 0 being completely similar. some distances are between 0 and 1. Others are unbouded.
|
||||
The function `compare` returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function `evaluate` returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.
|
||||
|
||||
```julia
|
||||
compare("New York", "New York", Levenshtein())
|
||||
|
@ -125,6 +117,12 @@ evaluate(Levenshtein(), "New York", "New York")
|
|||
#> 0
|
||||
```
|
||||
|
||||
## Which distance should I use?
|
||||
|
||||
As a rule of thumb,
|
||||
- Standardize strings before comparing them (cases, whitespaces, accents, abbreviations...)
|
||||
- The distance `Tokenmax(Levenshtein())` is a good choice to link sequence of words (adresses, names) across datasets.
|
||||
|
||||
## References
|
||||
- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo
|
||||
- [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
|
||||
|
|
Loading…
Reference in New Issue