Go to file
matthieugomez 82d5f3bc91 remove Hamming, create StringDistance 2019-12-12 15:11:32 -05:00
.github/workflows improve support for missings 2019-12-12 10:57:26 -05:00
benchmark remove Hamming, create StringDistance 2019-12-12 15:11:32 -05:00
src remove Hamming, create StringDistance 2019-12-12 15:11:32 -05:00
test remove Hamming, create StringDistance 2019-12-12 15:11:32 -05:00
.gitignore result_type for str metrics; fix type instability in RatcliffObershelp 2019-12-12 10:42:32 -05:00
.travis.yml Add osx, windows, coveralls to travis 2019-12-12 10:42:32 -05:00
LICENSE.md first commit 2015-10-22 12:12:44 -04:00
Project.toml parellelize find functions 2019-12-12 13:21:36 -05:00
README.md remove Hamming, create StringDistance 2019-12-12 15:11:32 -05:00

README.md

Build Status Coverage Status

This Julia package computes various distances between AbstractStrings

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Compare

The function compare returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. Its syntax is:

compare(::AbstractString, ::AbstractString, ::StringDistance)
  • Edit Distances

  • Q-gram distances compare the set of all substrings of length q in each string.

  • The package includes distance "modifiers", that can be applied to any distance.

    • Winkler boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
    • Partial returns the maximal similarity score between the shorter string and substrings of the longer string.
    • TokenSort adjusts for differences in word orders by reording words alphabetically.
    • TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
    • TokenMax combines scores using the base distance, the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths.

Some examples:

compare("martha", "marhta", Jaro())
compare("martha", "marhta", Winkler(Jaro()))
compare("william", "williams", QGram(2))
compare("william", "williams", Winkler(QGram(2)))
compare("New York Yankees", "Yankees", Levenshtein())
compare("New York Yankees", "Yankees", Partial(Levenshtein()))
compare("mariners vs angels", "los angeles angels at seattle mariners", Jaro())
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenSet(Jaro()))
compare("mariners vs angels", "los angeles angels at seattle mariners", TokenMax(RatcliffObershelp()))

A good distance to link adresses etc (where the word order does not matter) is TokenMax(Levenshtein()

Find

  • findmax returns the value and index of the element in iter with the highest similarity score with x. Its syntax is:

    findmax(x::AbstractString, iter::AbstractString, dist::StringDistance)
    
  • findall returns the indices of all elements in iter with a similarity score with x higher than a minimum value (default to 0.8). Its syntax is:

    findall(x::AbstractString, iter::AbstractVector, dist::StringDistance)
    

The functions findmax and findall are particularly optimized for Levenshtein and DamerauLevenshtein distances (as well as their modifications via Partial, TokenSort, TokenSet, or TokenMax).

Evaluate

The function compare returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function evaluate returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.

compare("New York", "New York", Levenshtein())
#> 1.0
evaluate(Levenshtein(), "New York", "New York")
#> 0

References