add RatcliffObershelp

2015-11-04 12:40:30 -05:00 · 2015-11-04 12:40:30 -05:00 · aa4c75a340
parent 99b997c9e2
commit aa4c75a340
16 changed files with 408 additions and 141 deletions
--- a/README.md
+++ b/README.md
@ -2,43 +2,95 @@
 [![Coverage Status](https://coveralls.io/repos/matthieugomez/StringDistances.jl/badge.svg?branch=master)](https://coveralls.io/r/matthieugomez/StringDistances.jl?branch=master)
 [![StringDistances](http://pkg.julialang.org/badges/StringDistances_0.4.svg)](http://pkg.julialang.org/?pkg=StringDistances)

-StringDistances allow to compute various distances between strings. The package should work with any `AbstractString` (in particular ASCII and UTF-8)
+This Julia package computes various distances between strings.
+


 ## Distances

+#### Edit Distances
 - Hamming Distance
 - Jaro Distance
 - Levenshtein Distance
 - Damerau-Levenshtein Distance
+- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) (similar to the Python library [difflib](https://docs.python.org/2/library/difflib.html))
+
+#### Q-Grams Distances
 - QGram Distance
 - Cosine Distance
 - Jaccard Distance

-
-A good reference about string distances is the article written for the R package `stringdist`:
+A good reference for q-gram distances is the article written for the R package `stringdist`:
 *The stringdist Package for Approximate String Matching* Mark P.J. van der Loo

+
 ## Syntax
- The basic syntax follows the [Distances](https://github.com/JuliaStats/Distances.jl) package:
+
+
+
+#### evaluate
+The function `evaluate` returns the litteral distance between two strings (a value of 0 being identical). While some distances are bounded by 1, other distances like `Hamming`, `Levenshtein`, `Damerau-Levenshtein`,  `Jaccard` can be higher than 1.
+
+```julia
+using StringDistances
+evaluate(Hamming(), "martha", "marhta")
+#> 2
+evaluate(QGram(2), "martha", "marhta")
+#> 6
+```
+
+#### compare
+The higher level function `compare` directly computes for any distance a similarity score between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.
+```julia
+using StringDistances
+compare(Hamming(), "martha", "marhta")
+#> 0.6666666666666667
+compare(QGram(2), "martha", "marhta")
+#> 0.4
+```
+
+
+## Modifiers
+
+The package defines a number of types to modify string metrics:
+
+- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes

 	```julia
-	using StringDistances
-	evaluate(Hamming(), "martha", "marhta")
-	evaluate(QGram(2), "martha", "marhta")
+	compare(Jaro(), "martha", "marhta")
+	#> 0.9444444444444445
+	compare(Winkler(Jaro()), "martha", "marhta")
+	#> 0.9611111111111111
 	```
-
- Normalize a distance between 0-1 with `Normalized`
+	The Winkler adjustment was originally defined for the Jaro distance but this package defines it for any string distance.

 	```julia
-	evaluate(Normalized(Hamming()), "martha", "marhta")
-	evaluate(Normalized(QGram(2)), "martha", "marhta")
+	compare(QGram(2), "william", "williams")
+	#> 0.9230769230769231
+	compare(Winkler(QGram(2)), "william", "williams")
+	#> 0.9538461538461539
 	```

- Add a [Winkler adjustment](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) with `Winkler`
+- For strings composed of several words, the Python library [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) defines a few modifiers for the `RatcliffObershelp` distance. This package defines them for any string distance:
+
+	- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in string lengths. The function returns the maximal similarity score between the shorter string and all substrings of the longer string. 	
+
+		```julia
+		compare(Partial(Hamming()), "New York Yankees", "Yankees")
+		#> 1.0
+		```
+
+	- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically.
+
+		```julia
+		compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")
+		#> 1.0
+		```
+
+	- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers.
+
+		```julia
+		compare(TokenSet(RatcliffObershelp()),"mariners vs angels", "los angeles angels of anaheim at seattle mariners")
+		```
+

-	```julia
-	evaluate(Winkler(Jaro()), "martha", "marhta")
-	evaluate(Winkler(Qgram(2)), "martha", "marhta")
-	```
-	While the Winkler adjustment was originally defined in the context of the Jaro distance, it can be helpful with other distances too. Note: a distance is automatically normalized between 0 and 1 when used with a Winkler adjustment.
--- a/3
+++ b/3
@ -1,2 +1,3 @@
 julia 0.4
-Distances
+Distances
+Iterators
--- a/benchmark/benchmark.jl
+++ b/benchmark/benchmark.jl
@ -19,6 +19,9 @@ end
@time f(Float64, Cosine(2), x, y)
@time f(Float64, Jaccard(2), x, y)

+#
+@time f(Float64, RatcliffObershelp(), x, y)
+



--- a/src/StringDistances.jl
+++ b/src/StringDistances.jl
@ -9,33 +9,34 @@ module StringDistances
 ##############################################################################

 import Distances: evaluate, Hamming, hamming, PreMetric, SemiMetric
-export evaluate,
-Hamming, hamming,
-Levenshtein, levenshtein,
-DamerauLevenshtein, damerau_levenshtein,
-Jaro, jaro,
-QGram, qgram,
-Cosine, cosine,
-Jaccard, jaccard,
-Normalized,
-Winkler
+import Iterators: chain
+export
+evaluate,
+compare,
+Hamming,
+Levenshtein,
+DamerauLevenshtein,
+Jaro,
+QGram,
+Cosine,
+Jaccard,
+longest_common_substring,
+matching_blocks,
+RatcliffObershelp,
+Winkler,
+Partial,
+TokenSort,
+TokenSet

+include("distances/evaluate.jl")
+include("distances/edit.jl")
+include("distances/qgram.jl")
+include("distances/RatcliffObershelp.jl")

-# 1. only do the switch once
-# 2. precomputes length(s1), length(s2)
-function evaluate(dist::PreMetric, s1::AbstractString, s2::AbstractString, x...)
-	len1, len2 = length(s1), length(s2)
-	if len1 > len2
-		return evaluate(dist, s2, s1, len2, len1, x...)
-	else
-		return evaluate(dist, s1, s2, len1, len2, x...)
-	end
-end
-
-include("edit.jl")
-include("qgram.jl")
-include("normalized.jl")
-include("winkler.jl")
+include("modifiers/compare.jl")
+include("modifiers/winkler.jl")
+include("modifiers/tokenize.jl")
+include("modifiers/partial.jl")


 end 
--- a/src/distances/RatcliffObershelp.jl
+++ b/src/distances/RatcliffObershelp.jl
@ -0,0 +1,57 @@
+# Return a character index, not a byte index
+function longest_common_substring(s1::AbstractString, s2::AbstractString)
+    len2 = length(s2)
+    start1, start2, size = 0, 0, 0
+    p = zeros(Int, len2)
+    i1 = 0
+    for ch1 in s1
+        i1 += 1
+        i2 = 0
+        oldp = 0
+        for ch2 in s2
+            i2 += 1
+            newp = 0
+            if ch1 == ch2
+                newp = oldp > 0 ? oldp : i2
+                currentlength = (i2 - newp + 1)
+                if currentlength > size
+                    start1, start2, size = i1 - currentlength + 1, newp, currentlength
+                end
+            end
+            p[i2], oldp = newp, p[i2]
+        end
+    end
+    return start1, start2, size
+end
+
+function matching_blocks!(x::Set{Tuple{Int, Int, Int}}, s1::AbstractString, s2::AbstractString, start1::Integer, start2::Integer)
+    a = longest_common_substring(s1, s2)
+    if a[3] > 0
+        push!(x, (a[1] + start1 - 1, a[2] + start2 - 1, a[3]))
+        s1before = SubString(s1, start(s1), chr2ind(s1, a[1]) - 1)
+        s2before = SubString(s2, start(s2), chr2ind(s2, a[2]) - 1)
+        matching_blocks!(x, s1before, s2before, start1, start2)
+        if (a[1] + a[3]) <= endof(s1) && (a[2] + a[3]) <= endof(s2)
+            s1after = SubString(s1, chr2ind(s1, a[1] + a[3]), endof(s1))
+            s2after = SubString(s2, chr2ind(s2, a[2] + a[3]), endof(s2))
+            matching_blocks!(x, s1after, s2after, start1 + a[1] + a[3] - 1, start2 + a[2] + a[3] - 1)
+        end
+    end
+end
+
+function matching_blocks(s1::AbstractString, s2::AbstractString)
+    x = Set{Tuple{Int, Int, Int}}()
+    matching_blocks!(x, s1, s2, 1, 1)
+    return x
+end
+
+type RatcliffObershelp <: PreMetric end
+function evaluate(dist::RatcliffObershelp, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
+    len2 == 0 && 0.0
+    result = matching_blocks(s1, s2)
+    matched = 0
+    for x in result
+        matched += x[3]
+    end
+    1.0 - 2 * matched / (len1 + len2)
+end
--- a/src/distances/edit.jl
+++ b/src/distances/edit.jl
@ -24,7 +24,7 @@ end
 ##
 ##############################################################################

-function evaluate(dist::Hamming, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
+function evaluate(dist::Hamming, s1::AbstractString, s2::AbstractString, len1::Integer, len2:: Integer)
    count = 0
    for (ch1, ch2) in zip(s1, s2)
        count += ch1 != ch2
@ -33,8 +33,6 @@ function evaluate(dist::Hamming, s1::AbstractString, s2::AbstractString, len1::I
    return count
 end

-hamming(s1::AbstractString, s2::AbstractString) = evaluate(Hamming(), s1, s2)
-
 ##############################################################################
 ##
 ## Levenshtein
@ -83,9 +81,6 @@ function evaluate(dist::Levenshtein, s1::AbstractString, s2::AbstractString, len
    end
    return current
 end
-function levenshtein(s1::AbstractString, s2::AbstractString)
-    evaluate(Levenshtein(), s1, s2)
-end

 ##############################################################################
 ##
@ -157,11 +152,9 @@ function evaluate(dist::DamerauLevenshtein, s1::AbstractString, s2::AbstractStri
    return current
 end

-damerau_levenshtein(s1::AbstractString, s2::AbstractString) = evaluate(DamerauLevenshtein(), s1, s2)
-
 ##############################################################################
 ##
-## JaroWinkler
+## Jaro
 ##
 ##############################################################################

@ -208,7 +201,3 @@ function evaluate(dist::Jaro, s1::AbstractString, s2::AbstractString, len1::Inte
 end

 jaro(s1::AbstractString, s2::AbstractString) = evaluate(Jaro(), s1, s2)
-
-
-
-
--- a/src/distances/evaluate.jl
+++ b/src/distances/evaluate.jl
@ -0,0 +1,8 @@
+function evaluate(dist::PreMetric, s1::AbstractString, s2::AbstractString)
+    len1, len2 = length(s1), length(s2)
+    if len1 > len2
+        return evaluate(dist, s2, s1, len2, len1)
+    else
+        return evaluate(dist, s1, s2, len1, len2)
+    end
+end
--- a/src/distances/qgram.jl
+++ b/src/distances/qgram.jl
@ -41,7 +41,7 @@ function Base.collect(qgram::QGramIterator)
 	end
 	return x
 end
-Base.sort(qgram::QGramIterator) = sort!(collect(qgram), alg = QuickSort)
+Base.sort(qgram::QGramIterator) = sort!(collect(qgram))

 ##############################################################################
 ##
@ -94,13 +94,12 @@ end
 ##
 ##############################################################################

-type QGram{T <: Integer} <: AbstractQGram
+immutable QGram{T <: Integer} <: AbstractQGram
 	q::T
 end
 QGram() = QGram(2)

 function evaluate(dist::QGram, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
-	len2 == 0 && return 0
 	n = 0
 	for (n1, n2) in PairIterator(s1, s2, len1, len2, dist.q)
 		n += abs(n1 - n2)
@ -119,14 +118,13 @@ end
 ## 1 - v(s1, p).v(s2, p)  / ||v(s1, p)|| * ||v(s2, p)||
 ##############################################################################

-type Cosine{T <: Integer} <: AbstractQGram
+immutable Cosine{T <: Integer} <: AbstractQGram
 	q::T
 end
 Cosine() = Cosine(2)

 function evaluate(dist::Cosine, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
-	len2 == 0 && return 0.0
-	(len1 <= (dist.q - 1)) && return convert(Float64, s1 != s2)
+	len1 <= (dist.q - 1) && return convert(Float64, s1 != s2)
 	norm1, norm2, prodnorm = 0, 0, 0
 	for (n1, n2) in PairIterator(s1, s2, len1, len2, dist.q)
 		norm1 += n1^2
@ -147,18 +145,15 @@ end
 ## Denote Q(s, q) the set of tuple of length q in s
 ## 1 - |intersect(Q(s1, q), Q(s2, q))| / |union(Q(s1, q), Q(s2, q))|
 ##
-## return 1.0 if smaller than qgram
-##
 ##############################################################################

-type Jaccard{T <: Integer} <: AbstractQGram
+immutable Jaccard{T <: Integer} <: AbstractQGram
 	q::T
 end
 Jaccard() = Jaccard(2)

 function evaluate(dist::Jaccard, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
-	len2 == 0 && return 0.0
-	(len1 <= (dist.q - 1)) && return convert(Float64, s1 != s2)
+	len1 <= (dist.q - 1) && return convert(Float64, s1 != s2)
 	ndistinct1, ndistinct2, nintersect = 0, 0, 0
 	for (n1, n2) in PairIterator(s1, s2, len1, len2, dist.q)
 		ndistinct1 += n1 > 0
--- a/src/modifiers/compare.jl
+++ b/src/modifiers/compare.jl
@ -0,0 +1,36 @@
+##############################################################################
+##
+## compare
+##
+##############################################################################
+
+function compare(dist::PreMetric, s1::AbstractString, s2::AbstractString)
+    len1, len2 = length(s1), length(s2)
+    if len1 > len2
+        return compare(dist, s2, s1, len2, len1)
+    else
+        return compare(dist, s1, s2, len1, len2)
+    end
+end
+
+
+
+function compare(dist::PreMetric, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
+    1.0 - evaluate(dist, s1, s2, len1, len2)
+end
+
+function compare(dist::Union{Hamming, Levenshtein, DamerauLevenshtein}, s1::AbstractString, s2::AbstractString,
+    len1::Integer, len2::Integer)
+    distance = evaluate(dist, s1, s2, len1, len2)
+    return len2 == 0 ? 1.0 : 1.0 - distance / len2
+end
+
+function compare(dist::QGram, s1::AbstractString, s2::AbstractString, 
+    len1::Integer, len2::Integer)
+    distance = evaluate(dist, s1, s2, len1, len2)
+    if len1 <= (dist.q - 1)
+        return s1 == s2 ? 1.0 : 0.0
+    else 
+        return 1 - distance / (len1 + len2 - 2 * dist.q + 2)
+    end
+end
--- a/src/modifiers/partial.jl
+++ b/src/modifiers/partial.jl
@ -0,0 +1,43 @@
+##############################################################################
+##
+## Partial
+## From the Python module fuzzywuzzy
+## http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
+##
+##############################################################################
+type Partial{T <: PreMetric} <: PreMetric
+    dist::T
+end
+
+# general
+function compare(dist::Partial, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
+    len1 == len2 && return compare(dist.dist, s1, s2, len1, len2)
+    len1 == 0 && return compare(dist.dist, "", "", 0, 0)
+    iter = QGramIterator(s2, len2, len1)
+    state = start(iter)
+    s, state = next(iter, state)
+    out = compare(dist.dist, s1, s)
+    while !done(iter, state)
+        s, state = next(iter, state)
+        curr = compare(dist.dist, s1, s)
+        out = max(out, curr)
+    end
+    return out
+end
+
+# Specialization for RatcliffObershelp distance
+# Code: https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/fuzz.py
+function compare(dist::Partial{RatcliffObershelp}, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
+    len1 == len2 && return compare(dist.dist, s1, s2, len1, len2)
+    out = 0.0
+    result = matching_blocks(s1, s2)
+    for r in result
+        s2_start = max(1, r[2] - r[1] + 1)
+        s2_end = s2_start + len1 - 1
+        i2_start =  chr2ind(s2, s2_start)
+        i2_end = s2_end == len2 ? endof(s2) : (chr2ind(s2, s2_end + 1) - 1)
+        curr = compare(RatcliffObershelp(), s1, SubString(s2, i2_start, i2_end), len1, len1)
+        out = max(out, curr)
+    end
+    return out
+end
--- a/src/modifiers/tokenize.jl
+++ b/src/modifiers/tokenize.jl
@ -0,0 +1,61 @@
+##############################################################################
+##
+## TokenSort
+##
+##############################################################################
+type TokenSort{T <: PreMetric} <: PreMetric
+    dist::T
+end
+
+function compare{T <: AbstractString}(dist::TokenSort, s1::T, s2::T, len1::Integer, len2::Integer)
+    s1 = join(sort!(split(s1)), " ")
+    s2 = join(sort!(split(s2)), " ")
+    compare(dist.dist, s1, s2)
+end
+
+##############################################################################
+##
+## TokenSet
+##
+##############################################################################
+type TokenSet{T <: PreMetric} <: PreMetric
+    dist::T
+end
+
+function compare{T <: AbstractString}(dist::TokenSet, s1::T, s2::T, len1::Integer, len2::Integer)
+    v0, v1, v2 = _separate!(split(s1), split(s2))
+    s0 = join(v0, " ")
+    s1 = join(chain(v0, v1), " ")
+    s2 = join(chain(v0, v2), " ")
+    if isempty(s0)
+        # otherwise compare(dist, "", "a")== 1.0 
+        compare(dist.dist, s1, s2)
+    else
+        max(compare(dist.dist, s0, s1), 
+            compare(dist.dist, s1, s2), 
+            compare(dist.dist, s0, s2))        
+    end
+end
+
+# separate 2 vectors in intersection, setdiff1, setdiff2 (all sorted)
+function _separate!(v1::Vector, v2::Vector)
+    sort!(v1)
+    sort!(v2)
+    out = eltype(v1)[]
+    start = 1
+    i1 = 0
+    while i1 < length(v1)
+        i1 += 1
+        x = v1[i1]
+        i2 = searchsortedfirst(v2, x, start, length(v2), Base.Forward)
+        i2 > length(v2) && break 
+        if i2 > 0 && v2[i2] == x
+            deleteat!(v1, i1)
+            deleteat!(v2, i2)
+            push!(out, x)
+            i1 -= 1
+            start = i2 
+        end
+    end
+    return out, v1, v2
+end
--- a/src/modifiers/winkler.jl
+++ b/src/modifiers/winkler.jl
@ -7,18 +7,18 @@
 type Winkler{T1 <: PreMetric, T2 <: Real, T3 <: Real} <: PreMetric
    dist::T1
    scaling_factor::T2      # scaling factor. Default to 0.1
-    boosting_limit::T3      # boost threshold. Default to 1.0 
+    boosting_limit::T3      # boost threshold. Default to 0.7
 end

 # restrict to distance between 0 and 1
-Winkler(x) = Winkler(x, 0.1, 1.0)
+Winkler(x) = Winkler(x, 0.1, 0.7)

-function evaluate(dist::Winkler, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
-    distance = evaluate(Normalized(dist.dist), s1, s2, len1, len2)
+function compare(dist::Winkler, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
+    score = compare(dist.dist, s1, s2, len1, len2)
    l = common_prefix(s1, s2, 4)[1]
    # common prefix adjustment
-    if distance <= dist.boosting_limit
-        distance -= distance * l * dist.scaling_factor
+    if score >= dist.boosting_limit
+        score += l * dist.scaling_factor * (1 - score)
    end
-    return distance
+    return score
 end
--- a/src/normalized.jl
+++ b/src/normalized.jl
@ -1,30 +0,0 @@
-##############################################################################
-##
-## Normalized
-##
-##############################################################################
-
-type Normalized{T <: PreMetric} <: PreMetric
-	dist::T
-end
-
-function evaluate(normalized::Normalized, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
-    evaluate(normalized.dist, s1, s2, len1, len2)
-end
-
-function evaluate{T <: Union{Hamming, Levenshtein, DamerauLevenshtein}}(
-	normalized::Normalized{T}, s1::AbstractString, s2::AbstractString,
-    len1::Integer, len2::Integer)
-    distance = evaluate(normalized.dist, s1, s2, len1, len2)
-    return distance / len2
-end
-
-function evaluate{T <: QGram}(normalized::Normalized{T}, s1::AbstractString, s2::AbstractString, 
-    len1::Integer, len2::Integer)
-    distance = evaluate(normalized.dist, s1, s2, len1, len2)
-    if len1 <= (normalized.dist.q - 1)
-    	return s1 == s2 ? 0.0 : 1.0
-    else 
-    	return distance / (len1 + len2 - 2 * normalized.dist.q + 2)
-    end
-end
--- a/test/distances.jl
+++ b/test/distances.jl
@ -2,15 +2,6 @@
 using StringDistances, Base.Test


-@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "martha", "marhta") 1 - 0.9611 1e-4
-@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "dwayne", "duane") 1 - 0.84 1e-4
-@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "dixon", "dicksonx") 1 - 0.81333 1e-4
-@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "william", "williams") 1 - 0.975 1e-4
-@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "", "foo") 1.0 1e-4
-@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "a", "a") 0.0 1e-4
-@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "abc", "xyz") 1.0 1e-4
-
-
@test evaluate(Levenshtein(), "", "") == 0
@test evaluate(Levenshtein(), "abc", "") == 3
@test evaluate(Levenshtein(), "", "abc") == 3
@ -43,24 +34,10 @@ using StringDistances, Base.Test
@test evaluate(Hamming(), "saturday", "sunday") == 7


-
-@test_approx_eq_eps evaluate(Normalized(Hamming()), "", "abc") 1.0 1e-4
-@test_approx_eq_eps evaluate(Normalized(Hamming()), "acc", "abc") 1/3 1e-4
-@test_approx_eq_eps evaluate(Normalized(Hamming()), "saturday", "sunday") 7/8 1e-4
-
-
-
-
-
-
@test evaluate(QGram(1), "", "abc") == 3
@test evaluate(QGram(1), "abc", "cba") == 0
@test evaluate(QGram(1), "abc", "ccc") == 4

-@test_approx_eq_eps evaluate(Normalized(QGram(1)), "", "abc") 1.0 1e-4
-@test_approx_eq_eps evaluate(Normalized(QGram(1)), "abc", "cba") 0.0 1e-4
-@test_approx_eq_eps evaluate(Normalized(QGram(1)), "abc", "ccc") 2/3 1e-4
-
@test_approx_eq_eps evaluate(Cosine(2), "", "abc") 1 1e-4
@test_approx_eq_eps evaluate(Cosine(2), "abc", "ccc") 1 1e-4
@test_approx_eq_eps evaluate(Cosine(2), "leia", "leela") 0.7113249 1e-4
@ -71,17 +48,6 @@ using StringDistances, Base.Test
@test_approx_eq_eps evaluate(Jaccard(2), "leia", "leela") 0.83333 1e-4


-
-
-
-
-
-
-
-
-
-
-
 strings = [
 ("martha", "marhta"),
 ("dwayne", "duane") ,
@ -106,7 +72,6 @@ strings = [
 for x in ((Levenshtein(), [2  2  4  1  3  0  3  2  3  3  4  6 17  3  3  2]),
 		(DamerauLevenshtein(), [1  2  4  1  3  0  3  2  3  3  4  6 17  2  2  2]),
 		(Jaro(), [0.05555556 0.17777778 0.23333333 0.04166667 1.00000000 0.00000000 1.00000000 0.44444444 0.25396825 0.24722222 0.16190476 0.48809524 0.49166667 0.07407407 0.16666667 0.21666667]),
-		(Winkler(Jaro(), 0.1, 1.0), [0.03888889 0.16000000 0.18666667 0.02500000 1.00000000 0.00000000 1.00000000 0.44444444 0.25396825 0.22250000 0.16190476 0.43928571 0.49166667 0.04444444 0.16666667 0.17333333]),
 		(QGram(1), [0   3   3   1 3  0   6   4   5   4   4  11  14   0   0   3]),
 		(QGram(2), [  6   7   7   1 2 0   4   4   7   8   4  13  32   8   6   5]),
 		(Jaccard(1), [0.0000000 0.4285714 0.3750000 0.1666667       1.0 0.0000000 1.0000000 0.6666667 0.5714286 0.3750000 0.2000000 0.8333333 0.5000000 0.0000000 0.0000000 0.2500000]),
@ -145,3 +110,28 @@ stringdist(strings[1,], strings[2,], method = "jw", p = 0.1)
 stringdist(strings[1,], strings[2,], method = "qgram", q = 1)

 =#
+
+
+
+Set([(1,1,3)
+(4,5,1)
+(6,6,1)
+])
+@test matching_blocks("dwayne", "duane") ==
+Set([(5,4,2)
+(1,1,1)
+(3,3,1)])
+@test matching_blocks("dixon", "dicksonx") ==
+Set([(1,1,2)
+ (4,6,2)
+ ])
+
+
+@test_approx_eq evaluate(RatcliffObershelp(), "dixon", "dicksonx") 1 - 0.6153846153846154
+@test_approx_eq evaluate(RatcliffObershelp(), "alexandre", "aleksander") 1 - 0.7368421052631579
+@test_approx_eq evaluate(RatcliffObershelp(), "pennsylvania",  "pencilvaneya") 1 - 0.6666666666666
+@test_approx_eq evaluate(RatcliffObershelp(), "",  "pencilvaneya") 1.0
+@test_approx_eq evaluate(RatcliffObershelp(),"NEW YORK METS", "NEW YORK MEATS") 1 -  0.962962962963
+@test_approx_eq evaluate(RatcliffObershelp(), "Yankees",  "New York Yankees") 0.3913043478260869
+@test_approx_eq evaluate(RatcliffObershelp(), "New York Mets",  "New York Yankees") 0.24137931034482762
+
--- a/test/modifiers.jl
+++ b/test/modifiers.jl
@ -0,0 +1,62 @@
+
+using StringDistances, Base.Test
+
+@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "martha", "marhta") 0.9611 1e-4
+@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "dwayne", "duane") 0.84 1e-4
+@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "dixon", "dicksonx") 0.81333 1e-4
+@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "william", "williams") 0.975 1e-4
+@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "", "foo") 0.0 1e-4
+@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "a", "a") 1.0 1e-4
+@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "abc", "xyz") 0.0 1e-4
+
+strings = [
+("martha", "marhta"),
+("dwayne", "duane") ,
+("dixon", "dicksonx"),
+("william", "williams"),
+("", "foo"),
+("a", "a"),
+("abc", "xyz"),
+("abc", "ccc"),
+("kitten", "sitting"),
+("saturday", "sunday"),
+("hi, my name is", "my name is"),
+("alborgów", "amoniak"),
+("cape sand recycling ", "edith ann graham"),
+( "jellyifhs", "jellyfish"),
+("ifhs", "fish"),
+("leia", "leela"),
+]
+solutions = [0.03888889 0.16000000 0.18666667 0.02500000 1.00000000 0.00000000 1.00000000 0.44444444 0.25396825 0.22250000 0.16190476 0.43928571 0.49166667 0.04444444 0.16666667 0.17333333]
+for i in 1:length(solutions)
+	@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), strings[i]...) (1 - solutions[i]) 1e-4
+end
+
+
+
+
+@test_approx_eq_eps compare(Hamming(), "", "abc") 0.0 1e-4
+@test_approx_eq_eps compare(Hamming(), "acc", "abc") 2/3 1e-4
+@test_approx_eq_eps compare(Hamming(), "saturday", "sunday") 1/8 1e-4
+
+@test_approx_eq_eps compare(QGram(1), "", "abc") 0.0 1e-4
+@test_approx_eq_eps compare(QGram(1), "abc", "cba") 1.0 1e-4
+@test_approx_eq_eps compare(QGram(1), "abc", "ccc") 1/3 1e-4
+
+
+@test_approx_eq compare(Partial(RatcliffObershelp()), "New York Yankees",  "Yankees") 1.0
+@test_approx_eq compare(Partial(RatcliffObershelp()), "New York Yankees",  "") 0.0
+
+
+@test_approx_eq compare(Partial(Hamming()), "New York Yankees",  "Yankees") 1
+@test_approx_eq compare(Partial(Hamming()), "New York Yankees",  "") 1
+
+
+
+
+@test_approx_eq compare(TokenSort(RatcliffObershelp()), "New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets")  1.0
+@test_approx_eq compare(TokenSet(RatcliffObershelp()),"mariners vs angels", "los angeles angels of anaheim at seattle mariners") 1.0 - 0.09090909090909094
+
+
+@test_approx_eq compare(TokenSort(RatcliffObershelp()), "New York Mets vs Atlanta Braves", "")  0.0
+@test_approx_eq compare(TokenSet(RatcliffObershelp()),"mariners vs angels", "") 0.0
--- a/test/runtests.jl
+++ b/test/runtests.jl
@ -1,7 +1,6 @@
 using StringDistances

-tests = ["distances.jl"
-		 ]
+tests = ["distances.jl", "modifiers.jl"]

 println("Running tests:")