You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
github-actions[bot] bc975b60b1 CompatHelper: bump compat for "Distances" to "0.9" 3 years ago
.github/workflows Install TagBot as a GitHub Action 4 years ago
src Update find.jl 4 years ago
test slower but simpler iteration 4 years ago
.gitignore slower but simpler iteration 4 years ago
.travis.yml Update .travis.yml 4 years ago first commit 8 years ago
Project.toml CompatHelper: bump compat for "Distances" to "0.9" 3 years ago Update 4 years ago

Build Status Coverage Status


The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Supported Distances

The available distances are:

  • Edit Distances
  • Q-gram distances compare the set of all substrings of length q in each string.
  • Distance "modifiers" that can be applied to any distance:
    • Winkler diminishes the distance of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but it can be defined for any string distance.
    • Partial returns the minimum distance between the shorter string and substrings of the longer string.
    • TokenSort adjusts for differences in word orders by reording words alphabetically.
    • TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
    • TokenMax combines scores using the base distance, the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses. TokenMax(Levenshtein()) corresponds to the distance defined in fuzzywuzzy

Basic Use


You can always compute a certain distance between two strings using the following syntax:

evaluate(dist, s1, s2)
dist(s1, s2)

For instance, with the Levenshtein distance,

evaluate(Levenshtein(), "martha", "marhta")
Levenshtein()("martha", "marhta")

You can also compute a distance between two iterators:

evaluate(Levenshtein(), [1, 5, 6], [1, 6, 5])


The function compare is defined as 1 minus the normalized distance between two strings. It always returns a Float64 between 0 and 1: a value of 0 means completely different and a value of 1 means completely similar.

evaluate(Levenshtein(),  "martha", "martha")
#> 0
compare("martha", "martha", Levenshtein())
#> 1.0


  • findmax returns the value and index of the element in itr with the highest similarity score with s. Its syntax is:

    findmax(s, itr, dist::StringDistance; min_score = 0.0)
  • findall returns the indices of all elements in itr with a similarity score with s higher than a minimum value (default to 0.8). Its syntax is:

    findall(s, itr, dist::StringDistance; min_score = 0.8)

The functions findmax and findall are particularly optimized for Levenshtein and DamerauLevenshtein distances (as well as their modifications via Partial, TokenSort, TokenSet, or TokenMax).