Go to file
Julia TagBot a949f7bd62 Install TagBot as a GitHub Action 2020-02-09 14:10:27 -05:00
.github/workflows Install TagBot as a GitHub Action 2020-02-09 14:10:27 -05:00
src type 2020-02-08 12:03:02 -05:00
test allow any iterator in qgram distances 2020-02-08 11:38:06 -05:00
.gitignore allow skipmissing iterator 2019-12-12 17:00:50 -05:00
.travis.yml Update .travis.yml 2019-12-13 11:16:22 -05:00
LICENSE.md first commit 2015-10-22 12:12:44 -04:00
Project.toml allow any iterator in qgram distances 2020-02-08 11:38:06 -05:00
README.md rmv datastructures + add docs 2019-12-13 10:59:09 -05:00

README.md

Build Status Coverage Status

This Julia package computes various distances between AbstractStrings

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Compare

The function compare returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. Its syntax is:

compare(s1::AbstractString, s2::AbstractString, dist::StringDistance)
  • Edit Distances

  • Q-gram distances compare the set of all substrings of length q in each string.

  • The package includes distance "modifiers", that can be applied to any distance.

    • Winkler boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
    • Partial returns the maximal similarity score between the shorter string and substrings of the longer string.
    • TokenSort adjusts for differences in word orders by reording words alphabetically.
    • TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
    • TokenMax combines scores using the base distance, the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths.

Some examples:

compare("martha", "marhta", Jaro())
compare("martha", "marhta", Winkler(Jaro()))
compare("martha", "marhta", QGram(2))
compare("martha", "marhta", Winkler(QGram(2)))
compare("martha", "marhta", Levenshtein())
compare("martha", "marhta", Partial(Levenshtein()))
compare("martha", "marhta", Jaro())
compare("martha", "marhta", TokenSet(Jaro()))
compare("martha", "marhta", TokenMax(RatcliffObershelp()))

A good distance to match strings composed of multiple words (like addresses) is TokenMax(Levenshtein()) (see fuzzywuzzy).

Find

  • findmax returns the value and index of the element in itr with the highest similarity score with s. Its syntax is:

    findmax(s::AbstractString, itr, dist::StringDistance; min_score = 0.0)
    
  • findall returns the indices of all elements in itr with a similarity score with s higher than a minimum value (default to 0.8). Its syntax is:

    findall(s::AbstractString, itr, dist::StringDistance; min_score = 0.8)
    

The functions findmax and findall are particularly optimized for Levenshtein and DamerauLevenshtein distances (as well as their modifications via Partial, TokenSort, TokenSet, or TokenMax).

Evaluate

The function compare returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function evaluate returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.

compare("New York", "New York", Levenshtein())
#> 1.0
evaluate(Levenshtein(), "New York", "New York")
#> 0

References