From fc5587a60c007d143517c4ef6c49ba670197b177 Mon Sep 17 00:00:00 2001 From: matthieugomez Date: Wed, 11 Dec 2019 16:12:24 -0500 Subject: [PATCH] fix travis --- .travis.yml | 2 +- README.md | 22 ++++++++++------------ 2 files changed, 11 insertions(+), 13 deletions(-) diff --git a/.travis.yml b/.travis.yml index 832ffbf..162a4d8 100644 --- a/.travis.yml +++ b/.travis.yml @@ -7,7 +7,7 @@ matrix: - julia: nightly after_success: - julia -e 'using Pkg; Pkg.add("Coverage"); using Coverage; Coveralls.submit(Coveralls.process_folder())' - notifications: +notifications: email: false on_success: never on_failure: change \ No newline at end of file diff --git a/README.md b/README.md index 92a77b0..da88de0 100644 --- a/README.md +++ b/README.md @@ -3,10 +3,12 @@ This Julia package computes various distances between `AbstractString`s +## Installation +The package is registered in the [`General`](https://github.com/JuliaRegistries/General) registry and so can be installed at the REPL with `] add StringDistances`. + ## Syntax The function `compare` returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. - ```julia using StringDistances compare("martha", "martha", Hamming()) @@ -15,8 +17,6 @@ compare("martha", "marhta", Hamming()) #> 0.6666666666666667 ``` - - ## Distances #### Edit Distances @@ -89,7 +89,6 @@ The package includes distance "modifiers", that can be applied to any distance. #> 0.855 ``` - ## Find `find_best` returns the element of an iterator with the highest similarity score ```julia @@ -107,16 +106,9 @@ find_all("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein(); min_ While these functions are defined for any distance, they are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`) -## Which distance should I use? - -As a rule of thumb, -- Standardize strings before comparing them (cases, whitespaces, accents, abbreviations...) -- The distance `Tokenmax(Levenshtein())` is a good choice to link names or adresses across datasets. - ## Compare vs Evaluate -The function `compare` returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. -In contrast, the function `evaluate` returns the litteral distance between two strings, with a value of 0 being completely similar. some distances are between 0 and 1. Others are unbouded. +The function `compare` returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function `evaluate` returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded. ```julia compare("New York", "New York", Levenshtein()) @@ -125,6 +117,12 @@ evaluate(Levenshtein(), "New York", "New York") #> 0 ``` +## Which distance should I use? + +As a rule of thumb, +- Standardize strings before comparing them (cases, whitespaces, accents, abbreviations...) +- The distance `Tokenmax(Levenshtein())` is a good choice to link sequence of words (adresses, names) across datasets. + ## References - [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo - [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)