advice
parent
1f4560f18b
commit
f1b5671a63
25
README.md
25
README.md
|
@ -111,18 +111,21 @@ The package defines a number of ways to modify string metrics:
|
|||
## Tips
|
||||
|
||||
- Each distance is tailored to a specific problem. Edit distances works well with local spelling errors, the Ratcliff-Obsershelp distance works well with edited texts, the Jaro Winkler distance was invented for short strings such as person names, the QGrams distances works well with strings composed of multiple words with fluctuating orderings.
|
||||
- When comparing company or individual names, each string is composed of multiple words and their ordering is mostly irrelevant. Edit distances will perform poorly in this situation. Use either a distance robust to word order (like QGram distances), or compose a distance with `TokenSort` or `TokenSet`, which reorder the words alphabetically.
|
||||
|
||||
```julia
|
||||
compare(RatcliffObershelp(), "mariners vs angels", "angels vs mariners")
|
||||
#> 0.44444
|
||||
compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")
|
||||
#> 1.0
|
||||
compare(Cosine(3), "mariners vs angels", "angels vs mariners")
|
||||
#> 0.8125
|
||||
```
|
||||
|
||||
- Standardize strings before comparing them (lowercase, punctuation, whitespaces, accents, abbreviations...)
|
||||
- Most distances will perform poorly when comparing company or individual names, where each string is composed of multiple words.
|
||||
|
||||
- While word ordering is mostly irrelevant in this situation, edit distances heavily penalize different word orders. Instead, use either a distance robust to word order (like QGram distances), or compose a distance with `TokenSort`, which reorders the words alphabetically.
|
||||
|
||||
```julia
|
||||
compare(RatcliffObershelp(), "mariners vs angels", "angels vs mariners")
|
||||
#> 0.44444
|
||||
compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")
|
||||
#> 1.0
|
||||
compare(Cosine(3), "mariners vs angels", "angels vs mariners")
|
||||
#> 0.8125
|
||||
```
|
||||
- General words (like "bank", "company") may appear in one string but no the other. One solution is to abbreviate these common names first to diminish their importance (ie "bk" "co"). Another solution is to use something like the `Partial` or `TokenSet` modifiers.
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
Loading…
Reference in New Issue