StringDistances.jl/README.md

4.4 KiB

StringDistances Build Status Coverage Status

This Julia package computes various distances between strings.

Distances

Edit Distances

Q-Grams Distances

Q-gram distances compare the set of all substrings of length q in each string.

Others

Syntax

The function compare returns a similarity score between two strings, based on their distance. The similarity score is always between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.

using StringDistances
compare(Hamming(), "martha", "marhta")
#> 0.6666666666666667
compare(QGram(2), "martha", "marhta")
#> 0.4

To return the litteral distance between two strings, use evaluate

Modifiers

The package includes distance modifiers:

  • Winkler boosts the similary score of strings with common prefixes

    compare(Jaro(), "martha", "marhta")
    #> 0.9444444444444445
    compare(Winkler(Jaro()), "martha", "marhta")
    #> 0.9611111111111111
    

    The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.

    compare(QGram(2), "william", "williams")
    #> 0.9230769230769231
    compare(Winkler(QGram(2)), "william", "williams")
    #> 0.9538461538461539
    
  • The Python library fuzzywuzzy defines a few modifiers for the RatcliffObershelp similarity score. This package replicates them and extends them to any string distance:

    • Partial returns the maximal similarity score between the shorter string and substrings of the longer string.

      compare(Levenshtein(), "New York Yankees", "Yankees")
      #> 0.4375
      compare(Partial(Levenshtein()), "New York Yankees", "Yankees")
      #> 1.0
      
    • TokenSort adjusts for differences in word orders by reording words alphabetically.

      compare(RatcliffObershelp(), "mariners vs angels", "angels vs mariners")
      #> 0.44444
      compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")
      #> 1.0
      
    • TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.

      compare(Jaro(),"mariners vs angels", "los angeles angels at seattle mariners")
      #> 0.559904
      compare(TokenSet(Jaro()),"mariners vs angels", "los angeles angels at seattle mariners")
      #> 0.944444
      
    • TokenMax combines scores using the base distance, the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths.

      compare(TokenMax(RatcliffObershelp()),"mariners vs angels", "los angeles angels at seattle mariners")
      #> 0.855
      
## References
- [The stringdist Package for Approximate String Matching](https://journal.r-project.org/archive/2014-1/loo.pdf) Mark P.J. van der Loo
- [fuzzywuzzy blog post](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)