Go to file
github-actions[bot] bc975b60b1 CompatHelper: bump compat for "Distances" to "0.9" 2020-05-20 12:03:09 +00:00
.github/workflows Install TagBot as a GitHub Action 2020-02-09 14:10:27 -05:00
src Update find.jl 2020-04-20 14:27:03 -04:00
test slower but simpler iteration 2020-02-18 08:18:45 -05:00
.gitignore slower but simpler iteration 2020-02-18 08:18:45 -05:00
.travis.yml Update .travis.yml 2019-12-13 11:16:22 -05:00
LICENSE.md first commit 2015-10-22 12:12:44 -04:00
Project.toml CompatHelper: bump compat for "Distances" to "0.9" 2020-05-20 12:03:09 +00:00
README.md Update README.md 2020-03-03 06:53:08 -05:00

README.md

Build Status Coverage Status

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Supported Distances

The available distances are:

  • Edit Distances
  • Q-gram distances compare the set of all substrings of length q in each string.
  • Distance "modifiers" that can be applied to any distance:
    • Winkler diminishes the distance of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but it can be defined for any string distance.
    • Partial returns the minimum distance between the shorter string and substrings of the longer string.
    • TokenSort adjusts for differences in word orders by reording words alphabetically.
    • TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
    • TokenMax combines scores using the base distance, the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses. TokenMax(Levenshtein()) corresponds to the distance defined in fuzzywuzzy

Basic Use

Evaluate

You can always compute a certain distance between two strings using the following syntax:

evaluate(dist, s1, s2)
dist(s1, s2)

For instance, with the Levenshtein distance,

evaluate(Levenshtein(), "martha", "marhta")
Levenshtein()("martha", "marhta")

You can also compute a distance between two iterators:

evaluate(Levenshtein(), [1, 5, 6], [1, 6, 5])
2

Compare

The function compare is defined as 1 minus the normalized distance between two strings. It always returns a Float64 between 0 and 1: a value of 0 means completely different and a value of 1 means completely similar.

evaluate(Levenshtein(),  "martha", "martha")
#> 0
compare("martha", "martha", Levenshtein())
#> 1.0

Find

  • findmax returns the value and index of the element in itr with the highest similarity score with s. Its syntax is:

    findmax(s, itr, dist::StringDistance; min_score = 0.0)
    
  • findall returns the indices of all elements in itr with a similarity score with s higher than a minimum value (default to 0.8). Its syntax is:

    findall(s, itr, dist::StringDistance; min_score = 0.8)
    

The functions findmax and findall are particularly optimized for Levenshtein and DamerauLevenshtein distances (as well as their modifications via Partial, TokenSort, TokenSet, or TokenMax).

References