redefine modifiers

2020-11-11 21:13:14 -08:00 · 2020-11-11 21:13:14 -08:00 · 730a513d8e
parent bd9c7fba24
commit 730a513d8e
8 changed files with 236 additions and 241 deletions
--- a/README.md
+++ b/README.md
@ -12,7 +12,7 @@ The available distances are:

 - Edit Distances
 	- Hamming Distance `Hamming()`
-	- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
+	- [Jaro and Jaro-Winkler Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()` `JaroWinkler()`
 	- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`
 	- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
 	- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp()`
@ -24,13 +24,13 @@ The available distances are:
 	- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice(q::Int)`
 	- [MorisitaOverlap Distance](https://en.wikipedia.org/wiki/Morisita%27s_overlap_index) `MorisitaOverlap(q::Int)`
 	- [Normalized Multiset Distance](https://www.sciencedirect.com/science/article/pii/S1047320313001417) `NMD(q::Int)`
- Distance "modifiers" that can be applied to any distance:
-	- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the minimum of the normalized distance between the shorter string and substrings of the longer string.
-	- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by returning the normalized distance of the two strings, after re-ordering words alphabetically. 
-	- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by returning the normalized distance between the intersection of two strings with each string.
-	- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) combines the normalized distance, the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses.   `TokenMax(Levenshtein())` corresponds to the distance defined in [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
-	- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) diminishes the normalized distance of strings with common prefixes.  The Winkler adjustment was originally defined for the Jaro similarity score but it can be defined for any string distance.

+
+The package also defines Distance "modifiers" that can be applied to any distance:
+		- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the minimum of the distance between the shorter string and substrings of the longer string.
+		- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by returning the distance of the two strings, after re-ordering words alphabetically. 
+		- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
+		- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) normalizes the distance, and combine the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses.   `TokenMax(Levenshtein())` corresponds to the distance defined in [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
 ## Basic Use

 ### evaluate
@ -49,36 +49,36 @@ Levenshtein()("martha", "marhta")
 ```

 ### pairwise
-`pairwise` returns the matrix of distance between two `AbstractVectors`
+`pairwise` returns the matrix of distance between two `AbstractVectors` of AbstractStrings

 ```julia
 pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])
 ```
 It is particularly fast for QGram-distances (each element is processed once).

-### compare
-The function `compare` is defined as 1 minus the normalized distance between two strings. It always returns a `Float64` between 0.0 and 1.0: a value of 0 means completely different and a value of 1 means completely similar.
+
+
+### compare and find
+The function `compare` is defined as 1 minus the normalized distance between two strings. It always returns a Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.

 ```julia
-evaluate(Levenshtein(),  "martha", "martha")
+Levenshtein()("martha", "martha")
 #> 0.0
 compare("martha", "martha", Levenshtein())
 #> 1.0
 ```

-
-### find
- `findnearest` returns the value and index of the element in `itr` with the lowest distance with `s`. Its syntax is:
+`findnearest` returns the value and index of the element in `itr` with the highest similarity score with `s`. Its syntax is:
 	```julia
-	findnearest(s, itr, dist::StringDistance; min_score = 0.0)
+	findnearest(s, itr, dist::StringDistance)
 	```

- `findall` returns the indices of all elements in `itr` with a similarity score with `s` higher than a minimum value (default to 0.8). Its syntax is:
+`findall` returns the indices of all elements in `itr` with a similarity score with `s` higher than a minimum value (default to 0.8). Its syntax is:
 	```julia
 	findall(s, itr, dist::StringDistance; min_score = 0.8)
 	```

-The functions `findnearest` and `findall` are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`).
+The functions `findnearest` and `findall` are particularly optimized for `Levenshtein`, `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`).


 ## References
--- a/src/StringDistances.jl
+++ b/src/StringDistances.jl
@ -5,16 +5,17 @@ using Distances
 include("distances/utils.jl")
 include("distances/edit.jl")
 include("distances/qgram.jl")
+include("modifiers.jl")
 include("normalize.jl")
-
-const StringDistance = Union{Hamming, Jaro, Levenshtein, DamerauLevenshtein, RatcliffObershelp, QGramDistance, Winkler, Partial, TokenSort, TokenSet, TokenMax, Normalize}
+include("find.jl")
+include("pairwise.jl")
 # Distances API
 Distances.result_type(dist::StringDistance, s1::Type, s2::Type) = typeof(dist("", ""))
 Distances.result_type(dist::StringDistance, s1, s2) = result_type(dist, typeof(s1), typeof(s2))


-include("find.jl")
-include("pairwise.jl")
+
+

 ##############################################################################
 ##
@ -28,18 +29,18 @@ Hamming,
 Levenshtein,
 DamerauLevenshtein,
 Jaro,
+JaroWinkler,
 RatcliffObershelp,
 QGramDistance,
 QGram,
+QGramDict,
+QGramSortedVector,
 Cosine,
 Jaccard,
 SorensenDice,
 Overlap,
 MorisitaOverlap,
 NMD,
-QGramDict,
-QGramSortedVector,
-Winkler,
 Partial,
 TokenSort,
 TokenSet,
--- a/src/distances/edit.jl
+++ b/src/distances/edit.jl
@ -12,13 +12,13 @@ Hamming() = Hamming(nothing)

 function (dist::Hamming)(s1, s2)
    ((s1 === missing) | (s2 === missing)) && return missing
-    current = abs(length(s2) - length(s1))
-    dist.max_dist !== nothing && current > dist.max_dist && return dist.max_dist + 1
+    out = abs(length(s2) - length(s1))
+    dist.max_dist !== nothing && out > dist.max_dist && return dist.max_dist + 1
    for (ch1, ch2) in zip(s1, s2)
-        current += ch1 != ch2
-        dist.max_dist !== nothing && current > dist.max_dist && return dist.max_dist + 1
+        out += ch1 != ch2
+        dist.max_dist !== nothing && out > dist.max_dist && return dist.max_dist + 1
    end
-    return current
+    return out
 end


@ -73,6 +73,37 @@ function (dist::Jaro)(s1, s2)
    return 1.0 - (m / len1 + m / len2 + (m - t/2) / m) / 3.0
 end

+
+"""
+    JaroWinkler(;p = 0.1, threshold = 0.3, maxlength = 4)
+
+Creates the JaroWinkler distance
+
+The JaroWinkler distance is defined as the Jaro distance, which is multiplied by
+``(1-min(l,  maxlength) * p)`` as long as it is lower than `threshold`, and where `l` denotes the length of the common prefix.
+"""
+struct JaroWinkler <: SemiMetric
+    p::Float64          # scaling factor. Default to 0.1
+    threshold::Float64  # boost limit. Default to 0.3
+    maxlength::Integer  # max length of common prefix. Default to 4
+end
+
+JaroWinkler(; p = 0.1, threshold = 0.3, maxlength = 4) = JaroWinkler(p, threshold, maxlength)
+
+## http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html
+function (dist::JaroWinkler)(s1, s2)
+    ((s1 === missing) | (s2 === missing)) && return missing
+    s1, s2 = reorder(s1, s2)
+    len1, len2 = length(s1), length(s2)
+    out = Jaro()(s1, s2)
+    if out <= dist.threshold
+        l = common_prefix(s1, s2)[1]
+        out = (1 - min(l, dist.maxlength) * dist.p) * out
+    end
+    return out
+end
+
+
 """
    Levenshtein()

--- a/src/find.jl
+++ b/src/find.jl
@ -1,3 +1,5 @@
+const StringDistance = Union{Hamming, Jaro, JaroWinkler,Levenshtein, DamerauLevenshtein, RatcliffObershelp, QGramDistance, Partial, TokenSort, TokenSet, TokenMax, Normalized}
+
 """
    compare(s1, s2, dist)

@ -10,16 +12,15 @@ julia> compare("martha", "marhta", Levenshtein())
 0.6666666666666667
 ```
 """
-compare(s1, s2, dist::StringDistance; min_score = 0.0) = 1 - normalize(dist)(s1, s2, 1 - min_score)
-
+function compare(s1, s2, dist::StringDistance; min_score = 0.0)
+    1 - normalize(dist, max_dist = 1 - min_score)(s1, s2)
+end 

 """
-    findnearest(s, itr, dist::StringDistance; min_score = 0.0) -> (x, index)
+    findnearest(s, itr, dist::StringDistance) -> (x, index)

 `findnearest` returns the value and index of the element of `itr` that has the 
-highest similarity score with `s` according to the distance `dist`. 
-It returns `(nothing, nothing)` if none of the elements has a similarity score 
-higher or equal to `min_score` (default to 0.0).
+lowest distance with `s` according to the distance `dist`. 

 It is particularly optimized for [`Levenshtein`](@ref) and [`DamerauLevenshtein`](@ref) distances 
 (as well as their modifications via [`Partial`](@ref), [`TokenSort`](@ref), [`TokenSet`](@ref), or [`TokenMax`](@ref)).
--- a/src/modifiers.jl
+++ b/src/modifiers.jl
@ -0,0 +1,121 @@
+"""
+   Partial(dist)
+
+Creates the `Partial{dist}` distance.
+
+`Partial{dist}`  returns the  minimum distance  between the shorter string and substrings of the longer string (of the size of the shorter stirng)
+
+See: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
+
+### Examples
+```julia-repl
+julia> s1 = "New York Mets vs Atlanta Braves"
+julia> s2 = "Atlanta Braves vs New York Mets"
+julia> evaluate(Partial(RatcliffObershelp()), s1, s2)
+0.5483870967741935
+```
+"""
+struct Partial{S <: SemiMetric} <: SemiMetric
+    dist::S
+end
+
+function (dist::Partial)(s1, s2)
+    s1, s2 = reorder(s1, s2)
+    len1, len2 = length(s1), length(s2)
+    out = dist.dist(s1, s2)
+    ((len1 == 0) | (len1 == len2)) && return out
+    for x in qgrams(s2, len1)
+        curr = dist.dist(s1, x)
+        out = min(out, curr)
+    end
+    return out
+end
+
+function (dist::Partial{RatcliffObershelp})(s1, s2)
+    s1, s2 = reorder(s1, s2)
+    len1, len2 = length(s1), length(s2)
+    len1 == len2 && return dist.dist(s1, s2)
+    out = 1.0
+    for r in matching_blocks(s1, s2)
+        # Make sure the substring of s2 has length len1
+        s2_start = r[2] - r[1] + 1
+        s2_end = s2_start + len1 - 1
+        if s2_start < 1
+            s2_end += 1 - s2_start
+            s2_start += 1 - s2_start
+        elseif s2_end > len2
+            s2_start += len2 - s2_end
+            s2_end += len2 - s2_end
+        end
+        curr = dist.dist(s1, _slice(s2, s2_start - 1, s2_end))
+        out = min(out, curr)
+    end
+    return out
+end
+
+"""
+   TokenSort(dist)
+
+Creates the `TokenSort{dist}` distance.
+
+`TokenSort{dist}` returns the distance between strings after reording words alphabetically.
+
+See: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
+
+### Examples
+```julia-repl
+julia> s1 = "New York Mets vs Atlanta Braves"
+julia> s1 = "New York Mets vs Atlanta Braves"
+julia> s2 = "Atlanta Braves vs New York Mets"
+julia> evaluate(TokenSort(RatcliffObershelp()), s1, s2)
+0.0
+```
+"""
+struct TokenSort{S <: SemiMetric} <: SemiMetric
+    dist::S
+end
+
+# http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
+function (dist::TokenSort)(s1::AbstractString, s2::AbstractString)
+    s1 = join(sort!(split(s1)), " ")
+    s2 = join(sort!(split(s2)), " ")
+    out = dist.dist(s1, s2)
+end
+
+"""
+   TokenSet(dist)
+
+Creates the `TokenSet{dist}` distance.
+
+`TokenSet{dist}` compares the intersection of two strings with each string, after reording words alphabetically
+
+See: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
+
+### Examples
+```julia-repl
+julia> s1 = "New York Mets vs Atlanta"
+julia> s2 = "Atlanta Braves vs New York Mets"
+julia> evaluate(TokenSet(RatcliffObershelp()), s1, s2)
+0.0
+```
+"""
+struct TokenSet{S <: SemiMetric} <: SemiMetric
+    dist::S
+end
+
+# http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
+function (dist::TokenSet)(s1::AbstractString, s2::AbstractString)
+    v1 = unique!(sort!(split(s1)))
+    v2 = unique!(sort!(split(s2)))
+    v0 = intersect(v1, v2)
+    s0 = join(v0, " ")
+    s1 = join(v1, " ")
+    s2 = join(v2, " ")
+    isempty(s0) && return dist.dist(s1, s2)
+    score_01 = dist.dist(s0, s1)
+    score_02 = dist.dist(s0, s2)
+    score_12 = dist.dist(s1, s2)
+    min(score_01, score_02, score_12)
+end
+
+
--- a/src/normalize.jl
+++ b/src/normalize.jl
@ -1,41 +1,34 @@

-struct Normalize{S <: SemiMetric} <: SemiMetric
-    dist::S
+struct Normalized{V <: SemiMetric} <: SemiMetric
+    dist::V
+    max_dist::Float64
 end

-"""
-   normalize(dist::SemiMetric)

-   Normalize a metric, so that `evaluate` always return a Float64 between 0 and 1
-"""
-normalize(dist::SemiMetric, max_dist = 1.0) = Normalize(dist)
-normalize(dist::Normalize, max_dist = 1.0) = Normalize(dist.dist)
-
-function (dist::Normalize{<:Hamming})(s1, s2, max_dist = 1.0)
+function (dist::Normalized{<:Hamming})(s1, s2)
    ((s1 === missing) | (s2 === missing)) && return missing
    s1, s2 = reorder(s1, s2)
    len1, len2 = length(s1), length(s2)
    len2 == 0 && return 1.0
    out = dist.dist(s1, s2) / len2
-    out > max_dist ? 1.0 : out
+    out > dist.max_dist ? 1.0 : out
 end

-# A normalized distance is between 0 and 1, and accept a third argument, max_dist.
-function (dist::Normalize{<: Union{Levenshtein, DamerauLevenshtein}})(s1, s2, max_dist = 1.0)
+function (dist::Normalized{<:Union{Levenshtein{Nothing}, DamerauLevenshtein{Nothing}}})(s1, s2)
    ((s1 === missing) | (s2 === missing)) && return missing
    s1, s2 = reorder(s1, s2)
    len1, len2 = length(s1), length(s2)
    len2 == 0 && return 1.0
    if dist.dist isa Levenshtein
-        d = Levenshtein(ceil(Int, len2 * max_dist))(s1, s2)
+        d = Levenshtein(ceil(Int, len2 * dist.max_dist))(s1, s2)
    else
-        d = DamerauLevenshtein(ceil(Int, len2 * max_dist))(s1, s2)
+        d = DamerauLevenshtein(ceil(Int, len2 * dist.max_dist))(s1, s2)
    end
    out = d / len2
-    out > max_dist ? 1.0 : out
+    out > dist.max_dist ? 1.0 : out
 end

-function (dist::Normalize{<: QGramDistance})(s1, s2, max_dist = 1.0)
+function (dist::Normalized{<:QGramDistance})(s1, s2)
    ((s1 === missing) | (s2 === missing)) && return missing
    # When string length < q for qgram distance, returns s1 == s2
    s1, s2 = reorder(s1, s2)
@ -46,143 +39,22 @@ function (dist::Normalize{<: QGramDistance})(s1, s2, max_dist = 1.0)
    else
        out = dist.dist(s1, s2)
    end
-    out > max_dist ? 1.0 : out
+    out > dist.max_dist ? 1.0 : out
 end

-function (dist::Normalize)(s1, s2, max_dist = 1.0)
+function (dist::Normalized)(s1, s2)
    out = dist.dist(s1, s2)
-    out > max_dist ? 1.0 : out
+    out > dist.max_dist ? 1.0 : out
 end

-"""
-   Partial(dist)

-Creates the `Partial{dist}` distance.

-`Partial{dist}` normalizes the string distance `dist` and modify it to return the 
-minimum distance  between the shorter string and substrings of the longer string
-
-### Examples
-```julia-repl
-julia> s1 = "New York Mets vs Atlanta Braves"
-julia> s2 = "Atlanta Braves vs New York Mets"
-julia> evaluate(Partial(RatcliffObershelp()), s1, s2)
-0.5483870967741935
-```
-"""
-struct Partial{S <: SemiMetric} <: SemiMetric
-    dist::S
-    Partial{S}(dist::S) where {S <: SemiMetric} = new(dist)
-end
-Partial(dist::SemiMetric) = Partial{typeof(normalize(dist))}(normalize(dist))
-normalize(dist::Partial) = dist
-
-function (dist::Partial)(s1, s2, max_dist = 1.0)
-    s1, s2 = reorder(s1, s2)
-    len1, len2 = length(s1), length(s2)
-    out = dist.dist(s1, s2, max_dist)
-    len1 == len2 && return out
-    len1 == 0 && return out
-    for x in qgrams(s2, len1)
-        curr = dist.dist(s1, x, max_dist)
-        out = min(out, curr)
-        max_dist = min(out, max_dist)
-    end
-    return out
-end
-
-function (dist::Partial{Normalize{RatcliffObershelp}})(s1, s2, max_dist = 1.0)
-    s1, s2 = reorder(s1, s2)
-    len1, len2 = length(s1), length(s2)
-    len1 == len2 && return dist.dist(s1, s2)
-    out = 1.0
-    for r in matching_blocks(s1, s2)
-        # Make sure the substring of s2 has length len1
-        s2_start = r[2] - r[1] + 1
-        s2_end = s2_start + len1 - 1
-        if s2_start < 1
-            s2_end += 1 - s2_start
-            s2_start += 1 - s2_start
-        elseif s2_end > len2
-            s2_start += len2 - s2_end
-            s2_end += len2 - s2_end
-        end
-        curr = dist.dist(s1, _slice(s2, s2_start - 1, s2_end))
-        out = min(out, curr)
-    end
-    return out
-end
-
-"""
-   TokenSort(dist)
-
-Creates the `TokenSort{dist}` distance.
-
-`TokenSort{dist}` normalizes the string distance `dist` and modify it to adjust for differences 
-in word orders by reording words alphabetically.
-
-### Examples
-```julia-repl
-julia> s1 = "New York Mets vs Atlanta Braves"
-julia> s1 = "New York Mets vs Atlanta Braves"
-julia> s2 = "Atlanta Braves vs New York Mets"
-julia> evaluate(TokenSort(RatcliffObershelp()), s1, s2)
-0.0
-```
-"""
-struct TokenSort{S <: SemiMetric} <: SemiMetric
-    dist::S
-    TokenSort{S}(dist::S) where {S <: SemiMetric} = new(dist)
-end
-TokenSort(dist::SemiMetric) = TokenSort{typeof(normalize(dist))}(normalize(dist))
-normalize(dist::TokenSort) = dist
-
-# http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
-function (dist::TokenSort)(s1::AbstractString, s2::AbstractString, max_dist = 1.0)
-    s1 = join(sort!(split(s1)), " ")
-    s2 = join(sort!(split(s2)), " ")
-    out = dist.dist(s1, s2, max_dist)
-end
-
-"""
-   TokenSet(dist)
-
-Creates the `TokenSet{dist}` distance.
-
-`TokenSet{dist}` normalizes the string distance `dist` and modify it to adjust for differences 
-in word orders and word numbers by comparing the intersection of two strings with each string.
-
-### Examples
-```julia-repl
-julia> s1 = "New York Mets vs Atlanta"
-julia> s2 = "Atlanta Braves vs New York Mets"
-julia> evaluate(TokenSet(RatcliffObershelp()), s1, s2)
-0.0
-```
-"""
-struct TokenSet{S <: SemiMetric} <: SemiMetric
-    dist::S
-    TokenSet{S}(dist::S) where {S <: SemiMetric} = new(dist)
-end
-TokenSet(dist::SemiMetric) = TokenSet{typeof(normalize(dist))}(normalize(dist))
-normalize(dist::TokenSet) = dist
-
-# http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
-function (dist::TokenSet)(s1::AbstractString, s2::AbstractString, max_dist = 1.0)
-    v1 = unique!(sort!(split(s1)))
-    v2 = unique!(sort!(split(s2)))
-    v0 = intersect(v1, v2)
-    s0 = join(v0, " ")
-    s1 = join(v1, " ")
-    s2 = join(v2, " ")
-    isempty(s0) && return dist.dist(s1, s2, max_dist)
-    score_01 = dist.dist(s0, s1, max_dist)
-    max_dist = min(max_dist, score_01)
-    score_02 = dist.dist(s0, s2, max_dist)
-    max_dist = min(max_dist, score_02)
-    score_12 = dist.dist(s1, s2, max_dist)
-    min(score_01, score_02, score_12)
-end
+normalize(dist::SemiMetric; max_dist = 1.0) = Normalized{typeof(dist)}(dist, max_dist)
+normalize(dist::Union{Jaro, JaroWinkler}; max_dist = 1.0) = dist
+normalize(dist::Partial; max_dist = 1.0) = Partial(normalize(dist.dist; max_dist = max_dist))
+normalize(dist::TokenSort; max_dist = 1.0) = TokenSort(normalize(dist.dist; max_dist = max_dist))
+normalize(dist::TokenSet; max_dist = 1.0) = TokenSet(normalize(dist.dist; max_dist = max_dist))
+normalize(dist::Normalized; max_dist = 1.0) = Normalized{typeof(dist.dist)}(dist.dist, max_dist)

 """
   TokenMax(dist)
@ -207,69 +79,38 @@ struct TokenMax{S <: SemiMetric} <: SemiMetric
 end

 TokenMax(dist::SemiMetric) = TokenMax{typeof(normalize(dist))}(normalize(dist))
-normalize(dist::TokenMax) = dist
+function normalize(dist::TokenMax; max_dist = 1.0)
+    dist = normalize(dist.dist; max_dist = max_dist)
+    TokenMax{typeof(dist)}(dist)
+end

-function (dist::TokenMax)(s1::AbstractString, s2::AbstractString, max_dist = 1.0)
+function (dist::TokenMax)(s1::AbstractString, s2::AbstractString)
    s1, s2 = reorder(s1, s2)
    len1, len2 = length(s1), length(s2)
-    score = dist.dist(s1, s2, max_dist)
+    _dist = deepcopy(dist.dist)
+    max_dist = _dist.max_dist
+    score = _dist(s1, s2)
    min_score = min(max_dist, score)
    unbase_scale = 0.95
    # if one string is much shorter than the other, use partial
    if length(s2) >= 1.5 * length(s1)
-        partial_dist = Partial(dist.dist)
        partial_scale = length(s2) > (8 * length(s1)) ? 0.6 : 0.9
-        score_partial = 1 - partial_scale * (1 - partial_dist(s1, s2, 1 - (1 - max_dist) / partial_scale))
+       _dist = Normalized(_dist.dist, 1 - (1 - max_dist) / partial_scale)
+        score_partial = 1 - partial_scale * (1 - Partial(_dist)(s1, s2))
        min_score = min(max_dist, score_partial)
-        score_sort = 1 - unbase_scale * partial_scale * 
-                (1 - TokenSort(partial_dist)(s1, s2, 1 - (1 - max_dist) / (unbase_scale * partial_scale)))
+       _dist = Normalized(_dist.dist, 1 - (1 - max_dist) / (unbase_scale * partial_scale))
+        score_sort = 1 - unbase_scale * partial_scale * (1 - TokenSort(Partial(_dist))(s1, s2))
        max_dist = min(max_dist, score_sort)
-        score_set = 1 - unbase_scale * partial_scale * 
-                (1 - TokenSet(partial_dist)(s1, s2, 1 - (1 - max_dist) / (unbase_scale * partial_scale))) 
+       _dist = Normalized(_dist.dist, 1 - (1 - max_dist) / (unbase_scale * partial_scale))
+        score_set = 1 - unbase_scale * partial_scale * (1 - TokenSet(Partial(_dist))(s1, s2)) 
        out = min(score, score_partial, score_sort, score_set)
    else
-        score_sort = 1 - unbase_scale * 
-                (1 - TokenSort(dist.dist)(s1, s2, 1 - (1 - max_dist) / unbase_scale))
+       _dist = Normalized(_dist.dist, 1 - (1 - max_dist) / unbase_scale)
+        score_sort = 1 - unbase_scale * (1 - TokenSort(_dist)(s1, s2))
        max_dist = min(max_dist, score_sort)
-        score_set = 1 - unbase_scale * 
-                (1 - TokenSet(dist.dist)(s1, s2, 1 - (1 - max_dist) / unbase_scale))
+       _dist = Normalized(_dist.dist,  1 - (1 - max_dist) / unbase_scale)
+        score_set = 1 - unbase_scale * (1 - TokenSet(_dist)(s1, s2))
        out = min(score, score_sort, score_set)
    end
    out > max_dist ? 1.0 : out
 end
-
-"""
-   Winkler(dist; p::Real = 0.1, threshold::Real = 0.7, maxlength::Integer = 4)
-
-Creates the `Winkler{dist, p, threshold, maxlength}` distance.
-
-`Winkler{dist, p, threshold, length)` normalizes the string distance `dist` and modify it to decrease the 
-distance between  two strings, when their original distance is below some `threshold`.
-The boost is equal to `min(l,  maxlength) * p * dist` where `l` denotes the 
-length of their common prefix and `dist` denotes the original distance
-"""
-struct Winkler{S <: SemiMetric} <: SemiMetric
-    dist::S
-    p::Float64          # scaling factor. Default to 0.1
-    threshold::Float64  # boost threshold. Default to 0.7
-    maxlength::Integer      # max length of common prefix. Default to 4
-    Winkler{S}(dist::S, p, threshold, maxlength) where {S <: SemiMetric} = new(dist, p, threshold, maxlength)
-end
-
-function Winkler(dist::SemiMetric; p = 0.1, threshold = 0.7, maxlength = 4)
-    p * maxlength <= 1 || throw("scaling factor times maxlength of common prefix must be lower than one")
-    dist = normalize(dist)
-    Winkler{typeof(dist)}(dist, 0.1, 0.7, 4)
-end
-normalize(dist::Winkler) = dist
-
-function (dist::Winkler)(s1, s2, max_dist = 1.0)
-    # cannot do max_dist because of boosting threshold
-    out = dist.dist(s1, s2)
-    if out <= 1 - dist.threshold
-        l = common_prefix(s1, s2)[1]
-        out -= min(l, dist.maxlength) * dist.p * out
-    end
-    out > max_dist ? 1.0 : out
-end
-
--- a/src/pairwise.jl
+++ b/src/pairwise.jl
@ -74,7 +74,7 @@ function Distances.pairwise!(R::AbstractMatrix, dist::StringDistance, xs::Abstra
 end

 function _preprocess(xs, dist::QGramDistance, preprocess)
-    if (preprocess === true) || (isnothing(preprocess) && length(xs) >= 5)
+    if preprocess === nothing ? length(xs) >= 5 : preprocess 
        return map(x -> x === missing ? x : QGramSortedVector(x, dist.q), xs)
    else
        return xs
--- a/test/modifiers.jl
+++ b/test/modifiers.jl
@ -26,13 +26,13 @@ using StringDistances, Unicode, Test
 	@test compare("ab", "de", Partial(DamerauLevenshtein())) == 0
 	@test normalize(Partial(DamerauLevenshtein()))("ab", "cde") == 1.0
 	# Winkler
-	@test compare("martha", "marhta", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) ≈ 0.9611 atol = 1e-4
-	@test compare("dwayne", "duane", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) ≈ 0.84 atol = 1e-4
-	@test compare("dixon", "dicksonx", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) ≈ 0.81333 atol = 1e-4
-	@test compare("william", "williams", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) ≈ 0.975 atol = 1e-4
-	@test compare("", "foo", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) ≈ 0.0 atol = 1e-4
-	@test compare("a", "a", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) ≈ 1.0 atol = 1e-4
-	@test compare("abc", "xyz", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) ≈ 0.0 atol = 1e-4
+	@test compare("martha", "marhta", JaroWinkler()) ≈ 0.9611 atol = 1e-4
+	@test compare("dwayne", "duane", JaroWinkler()) ≈ 0.84 atol = 1e-4
+	@test compare("dixon", "dicksonx", JaroWinkler()) ≈ 0.81333 atol = 1e-4
+	@test compare("william", "williams", JaroWinkler()) ≈ 0.975 atol = 1e-4
+	@test compare("", "foo", JaroWinkler()) ≈ 0.0 atol = 1e-4
+	@test compare("a", "a", JaroWinkler()) ≈ 1.0 atol = 1e-4
+	@test compare("abc", "xyz", JaroWinkler()) ≈ 0.0 atol = 1e-4

 	# RatcliffObershelp
 	@test compare("New York Mets vs Atlanta Braves", "", RatcliffObershelp())  ≈ 0.0
@ -104,9 +104,9 @@ using StringDistances, Unicode, Test
 	@test findnearest("New York", ["San Francisco", "NewYork", "Newark"], Levenshtein()) == ("NewYork", 2)
 	@test findnearest("New York", ["Newark", "San Francisco", "NewYork"], Levenshtein()) == ("NewYork", 3)

-	@test findnearest("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein(); min_score = 0.99) == (nothing, nothing)
+
 	@test findnearest("New York", ["NewYork", "Newark", "San Francisco"], Jaro()) == ("NewYork", 1)
-	@test findnearest("New York", ["NewYork", "Newark", "San Francisco"], QGram(2)) == ("NewYork", 1)
+	@test findnearest("New York", ["NewYork", "Newark", "San Francisco"], normalize(QGram(2))) == ("NewYork", 1)


 	@test findall("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein()) == [1]