Go to file
matthieugomez d650b62a40 clean tests 2020-11-14 12:37:04 -08:00
.github/workflows Update TagBot.yml 2020-11-02 08:01:20 -08:00
benchmark clean tests 2020-11-14 12:37:04 -08:00
src update 2020-11-14 11:40:44 -08:00
test clean tests 2020-11-14 12:37:04 -08:00
.gitignore clean tests 2020-11-14 12:37:04 -08:00
.travis.yml added doc strings and upped the dependency and CI to Julia 1.3 2020-10-24 21:01:39 +02:00
LICENSE.md first commit 2015-10-22 12:12:44 -04:00
Project.toml Update Project.toml 2020-11-11 21:18:09 -08:00
README.md Update README.md 2020-11-11 21:17:12 -08:00

README.md

Build Status Coverage Status

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Supported Distances

Distances are defined for AbstractStrings, and any iterator that define length() (e.g. graphemes, AbstractVector...)

The available distances are:

The package also defines Distance "modifiers" that can be applied to any distance.

  • Partial returns the minimum of the distance between the shorter string and substrings of the longer string.
  • TokenSort adjusts for differences in word orders by returning the distance of the two strings, after re-ordering words alphabetically.
  • TokenSet adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
  • TokenMax normalizes the distance, and combine the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses. TokenMax(Levenshtein()) corresponds to the distance defined in fuzzywuzzy

Basic Use

evaluate

You can always compute a certain distance between two strings using the following syntax:

evaluate(dist, s1, s2)
dist(s1, s2)

For instance, with the Levenshtein distance,

evaluate(Levenshtein(), "martha", "marhta")
Levenshtein()("martha", "marhta")

pairwise

pairwise returns the matrix of distance between two AbstractVectors of AbstractStrings

pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])

It is particularly fast for QGram-distances (each element is processed once).

similarly scores

  • The function compare returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns a Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.

    Levenshtein()("martha", "martha")
    #> 0.0
    compare("martha", "martha", Levenshtein())
    #> 1.0
    
  • findnearest returns the value and index of the element in itr with the highest similarity score with s. Its syntax is:

    findnearest(s, itr, dist::StringDistance)
    
  • findall returns the indices of all elements in itr with a similarity score with s higher than a minimum value (default to 0.8). Its syntax is:

    findall(s, itr, dist::StringDistance; min_score = 0.8)
    

The functions findnearest and findall are particularly optimized for Levenshtein, DamerauLevenshtein distances (as well as their modifications via Partial, TokenSort, TokenSet, or TokenMax).

References