StringDistances.jl/README.md

1.7 KiB

Build Status Coverage Status StringDistances

StringDistances allow to compute various distances between strings. The package should work with any AbstractString (in particular ASCII and UTF-8)

Distances

  • Hamming Distance
  • Jaro Distance
  • Levenshtein Distance
  • Damerau-Levenshtein Distance
  • QGram Distance
  • Cosine Distance
  • Jaccard Distance

A good reference about string distances is the article written for the R package stringdist: The stringdist Package for Approximate String Matching Mark P.J. van der Loo

Syntax

  • The basic syntax follows the Distances package:

    using StringDistances
    evaluate(Hamming(), "martha", "marhta")
    evaluate(QGram(2), "martha", "marhta")
    
  • Normalize a distance between 0-1 with Normalized

    evaluate(Normalized(Hamming()), "martha", "marhta")
    evaluate(Normalized(QGram(2)), "martha", "marhta")
    
  • Add a Winkler adjustment with Winkler

    evaluate(Winkler(Jaro()), "martha", "marhta")
    evaluate(Winkler(Qgram(2)), "martha", "marhta")
    

    While the Winkler adjustment was originally defined in the context of the Jaro distance, it can be helpful with other distances too. Note: a distance is automatically normalized between 0 and 1 when used with a Winkler adjustment.