pull/3/head
matthieugomez 2015-11-05 17:17:57 -05:00
parent 1f4560f18b
commit f1b5671a63
1 changed files with 14 additions and 11 deletions

View File

@ -111,18 +111,21 @@ The package defines a number of ways to modify string metrics:
## Tips
- Each distance is tailored to a specific problem. Edit distances works well with local spelling errors, the Ratcliff-Obsershelp distance works well with edited texts, the Jaro Winkler distance was invented for short strings such as person names, the QGrams distances works well with strings composed of multiple words with fluctuating orderings.
- When comparing company or individual names, each string is composed of multiple words and their ordering is mostly irrelevant. Edit distances will perform poorly in this situation. Use either a distance robust to word order (like QGram distances), or compose a distance with `TokenSort` or `TokenSet`, which reorder the words alphabetically.
```julia
compare(RatcliffObershelp(), "mariners vs angels", "angels vs mariners")
#> 0.44444
compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")
#> 1.0
compare(Cosine(3), "mariners vs angels", "angels vs mariners")
#> 0.8125
```
- Standardize strings before comparing them (lowercase, punctuation, whitespaces, accents, abbreviations...)
- Most distances will perform poorly when comparing company or individual names, where each string is composed of multiple words.
- While word ordering is mostly irrelevant in this situation, edit distances heavily penalize different word orders. Instead, use either a distance robust to word order (like QGram distances), or compose a distance with `TokenSort`, which reorders the words alphabetically.
```julia
compare(RatcliffObershelp(), "mariners vs angels", "angels vs mariners")
#> 0.44444
compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")
#> 1.0
compare(Cosine(3), "mariners vs angels", "angels vs mariners")
#> 0.8125
```
- General words (like "bank", "company") may appear in one string but no the other. One solution is to abbreviate these common names first to diminish their importance (ie "bk" "co"). Another solution is to use something like the `Partial` or `TokenSet` modifiers.