Go to file
matthieugomez cf1d578bf6 rmv max_dist as internal field 2021-09-13 09:14:02 -04:00
.github/workflows Update ci.yml 2021-07-31 10:43:15 -07:00
benchmark don't use reorder since not type stable 2021-09-11 15:46:33 -04:00
src rmv max_dist as internal field 2021-09-13 09:14:02 -04:00
test rmv max_dist as internal field 2021-09-13 09:14:02 -04:00
.gitignore clean tests 2020-11-14 12:37:04 -08:00
LICENSE.md first commit 2015-10-22 12:12:44 -04:00
Project.toml remove StringDistance type (since Distance does not exist) 2021-09-12 15:06:31 -04:00
README.md remove StringDistance type (since Distance does not exist) 2021-09-12 15:06:31 -04:00

README.md

Build status

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Supported Distances

The package defines two abstract types: StringSemiMetric <: SemiMetric, and StringMetric <: Metric. String distances inherit from one of these two types. They act over any pair of iterators that define length (this includes AbstractStrings, but also GraphemeIterators or AbstractVectors)

The available distances are:

Basic Use

distance

You can always compute a certain distance between two strings using the following syntax:

evaluate(dist, s1, s2)
dist(s1, s2)

For instance, with the Levenshtein distance,

evaluate(Levenshtein(), "martha", "marhta")
Levenshtein()("martha", "marhta")

In contrast, the function compare returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns an element of type Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.

compare("martha", "martha", Levenshtein())
#> 1.0

pairwise

pairwise returns the matrix of distance between two AbstractVectors of AbstractStrings (or iterators)

pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])

The function pairwise is particularly optimized for QGram-distances (each element is processed only once).

fuzzywuzzy

The package also defines Distance "modifiers" that are defined in the Python package - fuzzywuzzy. These modifiers are particularly helpful to match strings composed of multiple words (e.g. addresses, company names).

  • Partial returns the minimum of the distance between the shorter string and substrings of the longer string.
  • TokenSort adjusts for differences in word orders by returning the distance of the two strings, after re-ordering words alphabetically.
  • TokenSet adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
  • TokenMax normalizes the distance, and combine the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string. TokenMax(Levenshtein()) corresponds to the distance defined in fuzzywuzzy

find

The package also adds some convience function to find the element in a list that is closest to a given string

  • findnearest returns the value and index of the element in itr with the highest similarity score with s. Its syntax is:

    findnearest(s, itr, dist::StringDistance)
    
  • findall returns the indices of all elements in itr with a similarity score with s higher than a minimum value (default to 0.8). Its syntax is:

    findall(s, itr, dist::StringDistance; min_score = 0.8)
    

The functions findnearest and findall are particularly optimized for the Levenshtein and OptimalStringAlignement distances (these distances stop early if the distance is higher than a certain threshold).

Notes

  • All string lookups are case sensitive.