Go to file
matthieugomez 99d77a585b comments 2015-11-05 10:46:48 -05:00
benchmark add RatcliffObershelp 2015-11-05 10:32:58 -05:00
src comments 2015-11-05 10:46:48 -05:00
test add RatcliffObershelp 2015-11-05 10:32:58 -05:00
.travis.yml first commit 2015-10-22 12:12:44 -04:00
LICENSE.md first commit 2015-10-22 12:12:44 -04:00
README.md add RatcliffObershelp 2015-11-05 10:32:58 -05:00
REQUIRE add RatcliffObershelp 2015-11-05 10:32:58 -05:00

README.md

Build Status Coverage Status StringDistances

This Julia package computes various distances between strings.

Distances

Edit Distances

Q-Grams Distances

  • QGram Distance
  • Cosine Distance
  • Jaccard Distance

A good reference for q-gram distances is the article written for the R package stringdist: The stringdist Package for Approximate String Matching Mark P.J. van der Loo

Syntax

evaluate

The function evaluate returns the litteral distance between two strings (a value of 0 being identical). While some distances are bounded by 1, other distances like Hamming, Levenshtein, Damerau-Levenshtein, Jaccard can be higher than 1.

using StringDistances
evaluate(Hamming(), "martha", "marhta")
#> 2
evaluate(QGram(2), "martha", "marhta")
#> 6

compare

The higher level function compare directly computes for any distance a similarity score between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.

using StringDistances
compare(Hamming(), "martha", "marhta")
#> 0.6666666666666667
compare(QGram(2), "martha", "marhta")
#> 0.4

Modifiers

The package defines a number of types to modify string metrics:

  • Winkler boosts the similary score of strings with common prefixes

    compare(Jaro(), "martha", "marhta")
    #> 0.9444444444444445
    compare(Winkler(Jaro()), "martha", "marhta")
    #> 0.9611111111111111
    

    The Winkler adjustment was originally defined for the Jaro distance but this package defines it for any string distance.

    compare(QGram(2), "william", "williams")
    #> 0.9230769230769231
    compare(Winkler(QGram(2)), "william", "williams")
    #> 0.9538461538461539
    
  • For strings composed of several words, the Python library fuzzywuzzy defines a few modifiers for the RatcliffObershelp distance. This package defines them for any string distance:

    • Partial adjusts for differences in string lengths. The function returns the maximal similarity score between the shorter string and all substrings of the longer string.

      compare(Partial(Hamming()), "New York Yankees", "Yankees")
      #> 1.0
      
    • TokenSort adjusts for differences in word orders by reording words alphabetically.

      compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")
      #> 1.0
      
    • TokenSet adjusts for differences in word orders and word numbers.

      compare(TokenSet(RatcliffObershelp()),"mariners vs angels", "los angeles angels of anaheim at seattle mariners")