Go to file
matthieugomez 49b1f3b439 slower but simpler iteration 2020-02-18 08:18:45 -05:00
.github/workflows Install TagBot as a GitHub Action 2020-02-09 14:10:27 -05:00
src slower but simpler iteration 2020-02-18 08:18:45 -05:00
test slower but simpler iteration 2020-02-18 08:18:45 -05:00
.gitignore slower but simpler iteration 2020-02-18 08:18:45 -05:00
.travis.yml Update .travis.yml 2019-12-13 11:16:22 -05:00
LICENSE.md first commit 2015-10-22 12:12:44 -04:00
Project.toml simplify three arguments form 2020-02-13 09:44:27 -05:00
README.md Update README.md 2020-02-12 21:41:55 -05:00

README.md

Build Status Coverage Status

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Evaluate

The function evaluate returns the distance between two strings. Its syntax is:

evaluate(dist, s1, s2)

where s1 and s2 can be any iterator with a length method (e.g. AbstractString, GraphemeIterator, AbstractVector...), and dist is one of the following distances:

  • Edit Distances

  • Q-gram distances compare the set of all substrings of length q in each string.

  • The package includes distance "modifiers", that can be applied to any distance.

    • Winkler diminishes the distance of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
    • Partial returns the minimum distance between the shorter string and substrings of the longer string.
    • TokenSort adjusts for differences in word orders by reording words alphabetically.
    • TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
    • TokenMax combines scores using the base distance, the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths.

Some examples:

evaluate(Jaro(), "martha", "marhta")
evaluate(Winkler(Jaro()), "martha", "marhta")
evaluate(QGram(2), "martha", "marhta")
evaluate(Winkler(QGram(2)), "martha", "marhta")
evaluate(Levenshtein(), "martha", "marhta")
evaluate(Partial(Levenshtein()), "martha", "marhta")
evaluate(Jaro(), "martha", "marhta")
evaluate(TokenSet(Jaro()), "martha", "marhta")
evaluate(TokenMax(RatcliffObershelp()), "martha", "marhta")

Alternatively, each distance can be used as a callable to call the evaluate function of each metric or modified metric, for example:

Jaro()("martha", "marhta")
Winkler(Jaro())("martha", "marhta")
QGram(2)("martha", "marhta")

A good distance to match strings composed of multiple words (like addresses) is TokenMax(Levenshtein()) (see fuzzywuzzy).

Compare

The function compare is defined as 1 minus the normalized distance between two strings. It always returns a number between 0 and 1: a value of 0 means completely different and a value of 1 means completely similar.

evaluate(Levenshtein(), "New York", "New York")
#> 0
compare("New York", "New York", Levenshtein())
#> 1.0

Find

  • findmax returns the value and index of the element in itr with the highest similarity score with s. Its syntax is:

    findmax(s, itr, dist::StringDistance; min_score = 0.0)
    
  • findall returns the indices of all elements in itr with a similarity score with s higher than a minimum value (default to 0.8). Its syntax is:

    findall(s, itr, dist::StringDistance; min_score = 0.8)
    

The functions findmax and findall are particularly optimized for Levenshtein and DamerauLevenshtein distances (as well as their modifications via Partial, TokenSort, TokenSet, or TokenMax).

References