From f1b5671a635bf400c32c20653bbfb226d9e62fb7 Mon Sep 17 00:00:00 2001 From: matthieugomez Date: Thu, 5 Nov 2015 17:17:57 -0500 Subject: [PATCH] advice --- README.md | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index e9ee800..ec51976 100644 --- a/README.md +++ b/README.md @@ -111,18 +111,21 @@ The package defines a number of ways to modify string metrics: ## Tips - Each distance is tailored to a specific problem. Edit distances works well with local spelling errors, the Ratcliff-Obsershelp distance works well with edited texts, the Jaro Winkler distance was invented for short strings such as person names, the QGrams distances works well with strings composed of multiple words with fluctuating orderings. -- When comparing company or individual names, each string is composed of multiple words and their ordering is mostly irrelevant. Edit distances will perform poorly in this situation. Use either a distance robust to word order (like QGram distances), or compose a distance with `TokenSort` or `TokenSet`, which reorder the words alphabetically. - - ```julia - compare(RatcliffObershelp(), "mariners vs angels", "angels vs mariners") - #> 0.44444 - compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners") - #> 1.0 - compare(Cosine(3), "mariners vs angels", "angels vs mariners") - #> 0.8125 - ``` - - Standardize strings before comparing them (lowercase, punctuation, whitespaces, accents, abbreviations...) +- Most distances will perform poorly when comparing company or individual names, where each string is composed of multiple words. + + - While word ordering is mostly irrelevant in this situation, edit distances heavily penalize different word orders. Instead, use either a distance robust to word order (like QGram distances), or compose a distance with `TokenSort`, which reorders the words alphabetically. + + ```julia + compare(RatcliffObershelp(), "mariners vs angels", "angels vs mariners") + #> 0.44444 + compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners") + #> 1.0 + compare(Cosine(3), "mariners vs angels", "angels vs mariners") + #> 0.8125 + ``` + - General words (like "bank", "company") may appear in one string but no the other. One solution is to abbreviate these common names first to diminish their importance (ie "bk" "co"). Another solution is to use something like the `Partial` or `TokenSet` modifiers. +