Go to file
matthieugomez 8be5a00e3d remove trie 2019-12-12 19:08:49 -05:00
.github/workflows improve support for missings 2019-12-12 10:57:26 -05:00
src remove trie 2019-12-12 19:08:49 -05:00
test allow skipmissing iterator 2019-12-12 17:00:50 -05:00
.gitignore allow skipmissing iterator 2019-12-12 17:00:50 -05:00
.travis.yml Add osx, windows, coveralls to travis 2019-12-12 10:42:32 -05:00
LICENSE.md first commit 2015-10-22 12:12:44 -04:00
Project.toml rmv benchmarks 2019-12-12 17:09:00 -05:00
README.md allow skipmissing iterator 2019-12-12 17:00:50 -05:00

README.md

Build Status Coverage Status

This Julia package computes various distances between AbstractStrings

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Compare

The function compare returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. Its syntax is:

compare(::AbstractString, ::AbstractString, ::StringDistance)
  • Edit Distances

  • Q-gram distances compare the set of all substrings of length q in each string.

  • The package includes distance "modifiers", that can be applied to any distance.

    • Winkler boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
    • Partial returns the maximal similarity score between the shorter string and substrings of the longer string.
    • TokenSort adjusts for differences in word orders by reording words alphabetically.
    • TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
    • TokenMax combines scores using the base distance, the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths.

Some examples:

compare("martha", "marhta", Jaro())
compare("martha", "marhta", Winkler(Jaro()))
compare("william", "williams", QGram(2))
compare("william", "williams", Winkler(QGram(2)))
compare("New York Yankees", "Yankees", Levenshtein())
compare("New York Yankees", "Yankees", Partial(Levenshtein()))
compare("mariners vs angels", "los angeles angels at seattle mariners", Jaro())
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenSet(Jaro()))
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenMax(RatcliffObershelp()))

In case the word order does not matter, a good distance is TokenMax(Levenshtein())

Find

  • findmax returns the value and index of the element in itr with the highest similarity score with x. Its syntax is:

    findmax(x::AbstractString, itr, dist::StringDistance; min_score = 0.0)
    
  • findall returns the indices of all elements in itr with a similarity score with x higher than a minimum value (default to 0.8). Its syntax is:

    findall(x::AbstractString, itr, dist::StringDistance; min_score = 0.8)
    

The functions findmax and findall are particularly optimized for Levenshtein and DamerauLevenshtein distances (as well as their modifications via Partial, TokenSort, TokenSet, or TokenMax).

Evaluate

The function compare returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function evaluate returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.

compare("New York", "New York", Levenshtein())
#> 1.0
evaluate(Levenshtein(), "New York", "New York")
#> 0

References