redefine modifiers

pull/44/head
matthieugomez 2020-11-11 21:13:14 -08:00
parent bd9c7fba24
commit 730a513d8e
8 changed files with 236 additions and 241 deletions

View File

@ -12,7 +12,7 @@ The available distances are:
- Edit Distances
- Hamming Distance `Hamming()`
- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
- [Jaro and Jaro-Winkler Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()` `JaroWinkler()`
- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`
- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp()`
@ -24,13 +24,13 @@ The available distances are:
- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice(q::Int)`
- [MorisitaOverlap Distance](https://en.wikipedia.org/wiki/Morisita%27s_overlap_index) `MorisitaOverlap(q::Int)`
- [Normalized Multiset Distance](https://www.sciencedirect.com/science/article/pii/S1047320313001417) `NMD(q::Int)`
- Distance "modifiers" that can be applied to any distance:
- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the minimum of the normalized distance between the shorter string and substrings of the longer string.
- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by returning the normalized distance of the two strings, after re-ordering words alphabetically.
- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by returning the normalized distance between the intersection of two strings with each string.
- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) combines the normalized distance, the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses. `TokenMax(Levenshtein())` corresponds to the distance defined in [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) diminishes the normalized distance of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but it can be defined for any string distance.
The package also defines Distance "modifiers" that can be applied to any distance:
- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the minimum of the distance between the shorter string and substrings of the longer string.
- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by returning the distance of the two strings, after re-ordering words alphabetically.
- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) normalizes the distance, and combine the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses. `TokenMax(Levenshtein())` corresponds to the distance defined in [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
## Basic Use
### evaluate
@ -49,36 +49,36 @@ Levenshtein()("martha", "marhta")
```
### pairwise
`pairwise` returns the matrix of distance between two `AbstractVectors`
`pairwise` returns the matrix of distance between two `AbstractVectors` of AbstractStrings
```julia
pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])
```
It is particularly fast for QGram-distances (each element is processed once).
### compare
The function `compare` is defined as 1 minus the normalized distance between two strings. It always returns a `Float64` between 0.0 and 1.0: a value of 0 means completely different and a value of 1 means completely similar.
### compare and find
The function `compare` is defined as 1 minus the normalized distance between two strings. It always returns a Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.
```julia
evaluate(Levenshtein(), "martha", "martha")
Levenshtein()("martha", "martha")
#> 0.0
compare("martha", "martha", Levenshtein())
#> 1.0
```
### find
- `findnearest` returns the value and index of the element in `itr` with the lowest distance with `s`. Its syntax is:
`findnearest` returns the value and index of the element in `itr` with the highest similarity score with `s`. Its syntax is:
```julia
findnearest(s, itr, dist::StringDistance; min_score = 0.0)
findnearest(s, itr, dist::StringDistance)
```
- `findall` returns the indices of all elements in `itr` with a similarity score with `s` higher than a minimum value (default to 0.8). Its syntax is:
`findall` returns the indices of all elements in `itr` with a similarity score with `s` higher than a minimum value (default to 0.8). Its syntax is:
```julia
findall(s, itr, dist::StringDistance; min_score = 0.8)
```
The functions `findnearest` and `findall` are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`).
The functions `findnearest` and `findall` are particularly optimized for `Levenshtein`, `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`).
## References

View File

@ -5,16 +5,17 @@ using Distances
include("distances/utils.jl")
include("distances/edit.jl")
include("distances/qgram.jl")
include("modifiers.jl")
include("normalize.jl")
const StringDistance = Union{Hamming, Jaro, Levenshtein, DamerauLevenshtein, RatcliffObershelp, QGramDistance, Winkler, Partial, TokenSort, TokenSet, TokenMax, Normalize}
include("find.jl")
include("pairwise.jl")
# Distances API
Distances.result_type(dist::StringDistance, s1::Type, s2::Type) = typeof(dist("", ""))
Distances.result_type(dist::StringDistance, s1, s2) = result_type(dist, typeof(s1), typeof(s2))
include("find.jl")
include("pairwise.jl")
##############################################################################
##
@ -28,18 +29,18 @@ Hamming,
Levenshtein,
DamerauLevenshtein,
Jaro,
JaroWinkler,
RatcliffObershelp,
QGramDistance,
QGram,
QGramDict,
QGramSortedVector,
Cosine,
Jaccard,
SorensenDice,
Overlap,
MorisitaOverlap,
NMD,
QGramDict,
QGramSortedVector,
Winkler,
Partial,
TokenSort,
TokenSet,

View File

@ -12,13 +12,13 @@ Hamming() = Hamming(nothing)
function (dist::Hamming)(s1, s2)
((s1 === missing) | (s2 === missing)) && return missing
current = abs(length(s2) - length(s1))
dist.max_dist !== nothing && current > dist.max_dist && return dist.max_dist + 1
out = abs(length(s2) - length(s1))
dist.max_dist !== nothing && out > dist.max_dist && return dist.max_dist + 1
for (ch1, ch2) in zip(s1, s2)
current += ch1 != ch2
dist.max_dist !== nothing && current > dist.max_dist && return dist.max_dist + 1
out += ch1 != ch2
dist.max_dist !== nothing && out > dist.max_dist && return dist.max_dist + 1
end
return current
return out
end
@ -73,6 +73,37 @@ function (dist::Jaro)(s1, s2)
return 1.0 - (m / len1 + m / len2 + (m - t/2) / m) / 3.0
end
"""
JaroWinkler(;p = 0.1, threshold = 0.3, maxlength = 4)
Creates the JaroWinkler distance
The JaroWinkler distance is defined as the Jaro distance, which is multiplied by
``(1-min(l, maxlength) * p)`` as long as it is lower than `threshold`, and where `l` denotes the length of the common prefix.
"""
struct JaroWinkler <: SemiMetric
p::Float64 # scaling factor. Default to 0.1
threshold::Float64 # boost limit. Default to 0.3
maxlength::Integer # max length of common prefix. Default to 4
end
JaroWinkler(; p = 0.1, threshold = 0.3, maxlength = 4) = JaroWinkler(p, threshold, maxlength)
## http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html
function (dist::JaroWinkler)(s1, s2)
((s1 === missing) | (s2 === missing)) && return missing
s1, s2 = reorder(s1, s2)
len1, len2 = length(s1), length(s2)
out = Jaro()(s1, s2)
if out <= dist.threshold
l = common_prefix(s1, s2)[1]
out = (1 - min(l, dist.maxlength) * dist.p) * out
end
return out
end
"""
Levenshtein()

View File

@ -1,3 +1,5 @@
const StringDistance = Union{Hamming, Jaro, JaroWinkler,Levenshtein, DamerauLevenshtein, RatcliffObershelp, QGramDistance, Partial, TokenSort, TokenSet, TokenMax, Normalized}
"""
compare(s1, s2, dist)
@ -10,16 +12,15 @@ julia> compare("martha", "marhta", Levenshtein())
0.6666666666666667
```
"""
compare(s1, s2, dist::StringDistance; min_score = 0.0) = 1 - normalize(dist)(s1, s2, 1 - min_score)
function compare(s1, s2, dist::StringDistance; min_score = 0.0)
1 - normalize(dist, max_dist = 1 - min_score)(s1, s2)
end
"""
findnearest(s, itr, dist::StringDistance; min_score = 0.0) -> (x, index)
findnearest(s, itr, dist::StringDistance) -> (x, index)
`findnearest` returns the value and index of the element of `itr` that has the
highest similarity score with `s` according to the distance `dist`.
It returns `(nothing, nothing)` if none of the elements has a similarity score
higher or equal to `min_score` (default to 0.0).
lowest distance with `s` according to the distance `dist`.
It is particularly optimized for [`Levenshtein`](@ref) and [`DamerauLevenshtein`](@ref) distances
(as well as their modifications via [`Partial`](@ref), [`TokenSort`](@ref), [`TokenSet`](@ref), or [`TokenMax`](@ref)).

121
src/modifiers.jl Executable file
View File

@ -0,0 +1,121 @@
"""
Partial(dist)
Creates the `Partial{dist}` distance.
`Partial{dist}` returns the minimum distance between the shorter string and substrings of the longer string (of the size of the shorter stirng)
See: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
### Examples
```julia-repl
julia> s1 = "New York Mets vs Atlanta Braves"
julia> s2 = "Atlanta Braves vs New York Mets"
julia> evaluate(Partial(RatcliffObershelp()), s1, s2)
0.5483870967741935
```
"""
struct Partial{S <: SemiMetric} <: SemiMetric
dist::S
end
function (dist::Partial)(s1, s2)
s1, s2 = reorder(s1, s2)
len1, len2 = length(s1), length(s2)
out = dist.dist(s1, s2)
((len1 == 0) | (len1 == len2)) && return out
for x in qgrams(s2, len1)
curr = dist.dist(s1, x)
out = min(out, curr)
end
return out
end
function (dist::Partial{RatcliffObershelp})(s1, s2)
s1, s2 = reorder(s1, s2)
len1, len2 = length(s1), length(s2)
len1 == len2 && return dist.dist(s1, s2)
out = 1.0
for r in matching_blocks(s1, s2)
# Make sure the substring of s2 has length len1
s2_start = r[2] - r[1] + 1
s2_end = s2_start + len1 - 1
if s2_start < 1
s2_end += 1 - s2_start
s2_start += 1 - s2_start
elseif s2_end > len2
s2_start += len2 - s2_end
s2_end += len2 - s2_end
end
curr = dist.dist(s1, _slice(s2, s2_start - 1, s2_end))
out = min(out, curr)
end
return out
end
"""
TokenSort(dist)
Creates the `TokenSort{dist}` distance.
`TokenSort{dist}` returns the distance between strings after reording words alphabetically.
See: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
### Examples
```julia-repl
julia> s1 = "New York Mets vs Atlanta Braves"
julia> s1 = "New York Mets vs Atlanta Braves"
julia> s2 = "Atlanta Braves vs New York Mets"
julia> evaluate(TokenSort(RatcliffObershelp()), s1, s2)
0.0
```
"""
struct TokenSort{S <: SemiMetric} <: SemiMetric
dist::S
end
# http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
function (dist::TokenSort)(s1::AbstractString, s2::AbstractString)
s1 = join(sort!(split(s1)), " ")
s2 = join(sort!(split(s2)), " ")
out = dist.dist(s1, s2)
end
"""
TokenSet(dist)
Creates the `TokenSet{dist}` distance.
`TokenSet{dist}` compares the intersection of two strings with each string, after reording words alphabetically
See: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
### Examples
```julia-repl
julia> s1 = "New York Mets vs Atlanta"
julia> s2 = "Atlanta Braves vs New York Mets"
julia> evaluate(TokenSet(RatcliffObershelp()), s1, s2)
0.0
```
"""
struct TokenSet{S <: SemiMetric} <: SemiMetric
dist::S
end
# http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
function (dist::TokenSet)(s1::AbstractString, s2::AbstractString)
v1 = unique!(sort!(split(s1)))
v2 = unique!(sort!(split(s2)))
v0 = intersect(v1, v2)
s0 = join(v0, " ")
s1 = join(v1, " ")
s2 = join(v2, " ")
isempty(s0) && return dist.dist(s1, s2)
score_01 = dist.dist(s0, s1)
score_02 = dist.dist(s0, s2)
score_12 = dist.dist(s1, s2)
min(score_01, score_02, score_12)
end

View File

@ -1,41 +1,34 @@
struct Normalize{S <: SemiMetric} <: SemiMetric
dist::S
struct Normalized{V <: SemiMetric} <: SemiMetric
dist::V
max_dist::Float64
end
"""
normalize(dist::SemiMetric)
Normalize a metric, so that `evaluate` always return a Float64 between 0 and 1
"""
normalize(dist::SemiMetric, max_dist = 1.0) = Normalize(dist)
normalize(dist::Normalize, max_dist = 1.0) = Normalize(dist.dist)
function (dist::Normalize{<:Hamming})(s1, s2, max_dist = 1.0)
function (dist::Normalized{<:Hamming})(s1, s2)
((s1 === missing) | (s2 === missing)) && return missing
s1, s2 = reorder(s1, s2)
len1, len2 = length(s1), length(s2)
len2 == 0 && return 1.0
out = dist.dist(s1, s2) / len2
out > max_dist ? 1.0 : out
out > dist.max_dist ? 1.0 : out
end
# A normalized distance is between 0 and 1, and accept a third argument, max_dist.
function (dist::Normalize{<: Union{Levenshtein, DamerauLevenshtein}})(s1, s2, max_dist = 1.0)
function (dist::Normalized{<:Union{Levenshtein{Nothing}, DamerauLevenshtein{Nothing}}})(s1, s2)
((s1 === missing) | (s2 === missing)) && return missing
s1, s2 = reorder(s1, s2)
len1, len2 = length(s1), length(s2)
len2 == 0 && return 1.0
if dist.dist isa Levenshtein
d = Levenshtein(ceil(Int, len2 * max_dist))(s1, s2)
d = Levenshtein(ceil(Int, len2 * dist.max_dist))(s1, s2)
else
d = DamerauLevenshtein(ceil(Int, len2 * max_dist))(s1, s2)
d = DamerauLevenshtein(ceil(Int, len2 * dist.max_dist))(s1, s2)
end
out = d / len2
out > max_dist ? 1.0 : out
out > dist.max_dist ? 1.0 : out
end
function (dist::Normalize{<: QGramDistance})(s1, s2, max_dist = 1.0)
function (dist::Normalized{<:QGramDistance})(s1, s2)
((s1 === missing) | (s2 === missing)) && return missing
# When string length < q for qgram distance, returns s1 == s2
s1, s2 = reorder(s1, s2)
@ -46,143 +39,22 @@ function (dist::Normalize{<: QGramDistance})(s1, s2, max_dist = 1.0)
else
out = dist.dist(s1, s2)
end
out > max_dist ? 1.0 : out
out > dist.max_dist ? 1.0 : out
end
function (dist::Normalize)(s1, s2, max_dist = 1.0)
function (dist::Normalized)(s1, s2)
out = dist.dist(s1, s2)
out > max_dist ? 1.0 : out
out > dist.max_dist ? 1.0 : out
end
"""
Partial(dist)
Creates the `Partial{dist}` distance.
`Partial{dist}` normalizes the string distance `dist` and modify it to return the
minimum distance between the shorter string and substrings of the longer string
### Examples
```julia-repl
julia> s1 = "New York Mets vs Atlanta Braves"
julia> s2 = "Atlanta Braves vs New York Mets"
julia> evaluate(Partial(RatcliffObershelp()), s1, s2)
0.5483870967741935
```
"""
struct Partial{S <: SemiMetric} <: SemiMetric
dist::S
Partial{S}(dist::S) where {S <: SemiMetric} = new(dist)
end
Partial(dist::SemiMetric) = Partial{typeof(normalize(dist))}(normalize(dist))
normalize(dist::Partial) = dist
function (dist::Partial)(s1, s2, max_dist = 1.0)
s1, s2 = reorder(s1, s2)
len1, len2 = length(s1), length(s2)
out = dist.dist(s1, s2, max_dist)
len1 == len2 && return out
len1 == 0 && return out
for x in qgrams(s2, len1)
curr = dist.dist(s1, x, max_dist)
out = min(out, curr)
max_dist = min(out, max_dist)
end
return out
end
function (dist::Partial{Normalize{RatcliffObershelp}})(s1, s2, max_dist = 1.0)
s1, s2 = reorder(s1, s2)
len1, len2 = length(s1), length(s2)
len1 == len2 && return dist.dist(s1, s2)
out = 1.0
for r in matching_blocks(s1, s2)
# Make sure the substring of s2 has length len1
s2_start = r[2] - r[1] + 1
s2_end = s2_start + len1 - 1
if s2_start < 1
s2_end += 1 - s2_start
s2_start += 1 - s2_start
elseif s2_end > len2
s2_start += len2 - s2_end
s2_end += len2 - s2_end
end
curr = dist.dist(s1, _slice(s2, s2_start - 1, s2_end))
out = min(out, curr)
end
return out
end
"""
TokenSort(dist)
Creates the `TokenSort{dist}` distance.
`TokenSort{dist}` normalizes the string distance `dist` and modify it to adjust for differences
in word orders by reording words alphabetically.
### Examples
```julia-repl
julia> s1 = "New York Mets vs Atlanta Braves"
julia> s1 = "New York Mets vs Atlanta Braves"
julia> s2 = "Atlanta Braves vs New York Mets"
julia> evaluate(TokenSort(RatcliffObershelp()), s1, s2)
0.0
```
"""
struct TokenSort{S <: SemiMetric} <: SemiMetric
dist::S
TokenSort{S}(dist::S) where {S <: SemiMetric} = new(dist)
end
TokenSort(dist::SemiMetric) = TokenSort{typeof(normalize(dist))}(normalize(dist))
normalize(dist::TokenSort) = dist
# http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
function (dist::TokenSort)(s1::AbstractString, s2::AbstractString, max_dist = 1.0)
s1 = join(sort!(split(s1)), " ")
s2 = join(sort!(split(s2)), " ")
out = dist.dist(s1, s2, max_dist)
end
"""
TokenSet(dist)
Creates the `TokenSet{dist}` distance.
`TokenSet{dist}` normalizes the string distance `dist` and modify it to adjust for differences
in word orders and word numbers by comparing the intersection of two strings with each string.
### Examples
```julia-repl
julia> s1 = "New York Mets vs Atlanta"
julia> s2 = "Atlanta Braves vs New York Mets"
julia> evaluate(TokenSet(RatcliffObershelp()), s1, s2)
0.0
```
"""
struct TokenSet{S <: SemiMetric} <: SemiMetric
dist::S
TokenSet{S}(dist::S) where {S <: SemiMetric} = new(dist)
end
TokenSet(dist::SemiMetric) = TokenSet{typeof(normalize(dist))}(normalize(dist))
normalize(dist::TokenSet) = dist
# http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
function (dist::TokenSet)(s1::AbstractString, s2::AbstractString, max_dist = 1.0)
v1 = unique!(sort!(split(s1)))
v2 = unique!(sort!(split(s2)))
v0 = intersect(v1, v2)
s0 = join(v0, " ")
s1 = join(v1, " ")
s2 = join(v2, " ")
isempty(s0) && return dist.dist(s1, s2, max_dist)
score_01 = dist.dist(s0, s1, max_dist)
max_dist = min(max_dist, score_01)
score_02 = dist.dist(s0, s2, max_dist)
max_dist = min(max_dist, score_02)
score_12 = dist.dist(s1, s2, max_dist)
min(score_01, score_02, score_12)
end
normalize(dist::SemiMetric; max_dist = 1.0) = Normalized{typeof(dist)}(dist, max_dist)
normalize(dist::Union{Jaro, JaroWinkler}; max_dist = 1.0) = dist
normalize(dist::Partial; max_dist = 1.0) = Partial(normalize(dist.dist; max_dist = max_dist))
normalize(dist::TokenSort; max_dist = 1.0) = TokenSort(normalize(dist.dist; max_dist = max_dist))
normalize(dist::TokenSet; max_dist = 1.0) = TokenSet(normalize(dist.dist; max_dist = max_dist))
normalize(dist::Normalized; max_dist = 1.0) = Normalized{typeof(dist.dist)}(dist.dist, max_dist)
"""
TokenMax(dist)
@ -207,69 +79,38 @@ struct TokenMax{S <: SemiMetric} <: SemiMetric
end
TokenMax(dist::SemiMetric) = TokenMax{typeof(normalize(dist))}(normalize(dist))
normalize(dist::TokenMax) = dist
function normalize(dist::TokenMax; max_dist = 1.0)
dist = normalize(dist.dist; max_dist = max_dist)
TokenMax{typeof(dist)}(dist)
end
function (dist::TokenMax)(s1::AbstractString, s2::AbstractString, max_dist = 1.0)
function (dist::TokenMax)(s1::AbstractString, s2::AbstractString)
s1, s2 = reorder(s1, s2)
len1, len2 = length(s1), length(s2)
score = dist.dist(s1, s2, max_dist)
_dist = deepcopy(dist.dist)
max_dist = _dist.max_dist
score = _dist(s1, s2)
min_score = min(max_dist, score)
unbase_scale = 0.95
# if one string is much shorter than the other, use partial
if length(s2) >= 1.5 * length(s1)
partial_dist = Partial(dist.dist)
partial_scale = length(s2) > (8 * length(s1)) ? 0.6 : 0.9
score_partial = 1 - partial_scale * (1 - partial_dist(s1, s2, 1 - (1 - max_dist) / partial_scale))
_dist = Normalized(_dist.dist, 1 - (1 - max_dist) / partial_scale)
score_partial = 1 - partial_scale * (1 - Partial(_dist)(s1, s2))
min_score = min(max_dist, score_partial)
score_sort = 1 - unbase_scale * partial_scale *
(1 - TokenSort(partial_dist)(s1, s2, 1 - (1 - max_dist) / (unbase_scale * partial_scale)))
_dist = Normalized(_dist.dist, 1 - (1 - max_dist) / (unbase_scale * partial_scale))
score_sort = 1 - unbase_scale * partial_scale * (1 - TokenSort(Partial(_dist))(s1, s2))
max_dist = min(max_dist, score_sort)
score_set = 1 - unbase_scale * partial_scale *
(1 - TokenSet(partial_dist)(s1, s2, 1 - (1 - max_dist) / (unbase_scale * partial_scale)))
_dist = Normalized(_dist.dist, 1 - (1 - max_dist) / (unbase_scale * partial_scale))
score_set = 1 - unbase_scale * partial_scale * (1 - TokenSet(Partial(_dist))(s1, s2))
out = min(score, score_partial, score_sort, score_set)
else
score_sort = 1 - unbase_scale *
(1 - TokenSort(dist.dist)(s1, s2, 1 - (1 - max_dist) / unbase_scale))
_dist = Normalized(_dist.dist, 1 - (1 - max_dist) / unbase_scale)
score_sort = 1 - unbase_scale * (1 - TokenSort(_dist)(s1, s2))
max_dist = min(max_dist, score_sort)
score_set = 1 - unbase_scale *
(1 - TokenSet(dist.dist)(s1, s2, 1 - (1 - max_dist) / unbase_scale))
_dist = Normalized(_dist.dist, 1 - (1 - max_dist) / unbase_scale)
score_set = 1 - unbase_scale * (1 - TokenSet(_dist)(s1, s2))
out = min(score, score_sort, score_set)
end
out > max_dist ? 1.0 : out
end
"""
Winkler(dist; p::Real = 0.1, threshold::Real = 0.7, maxlength::Integer = 4)
Creates the `Winkler{dist, p, threshold, maxlength}` distance.
`Winkler{dist, p, threshold, length)` normalizes the string distance `dist` and modify it to decrease the
distance between two strings, when their original distance is below some `threshold`.
The boost is equal to `min(l, maxlength) * p * dist` where `l` denotes the
length of their common prefix and `dist` denotes the original distance
"""
struct Winkler{S <: SemiMetric} <: SemiMetric
dist::S
p::Float64 # scaling factor. Default to 0.1
threshold::Float64 # boost threshold. Default to 0.7
maxlength::Integer # max length of common prefix. Default to 4
Winkler{S}(dist::S, p, threshold, maxlength) where {S <: SemiMetric} = new(dist, p, threshold, maxlength)
end
function Winkler(dist::SemiMetric; p = 0.1, threshold = 0.7, maxlength = 4)
p * maxlength <= 1 || throw("scaling factor times maxlength of common prefix must be lower than one")
dist = normalize(dist)
Winkler{typeof(dist)}(dist, 0.1, 0.7, 4)
end
normalize(dist::Winkler) = dist
function (dist::Winkler)(s1, s2, max_dist = 1.0)
# cannot do max_dist because of boosting threshold
out = dist.dist(s1, s2)
if out <= 1 - dist.threshold
l = common_prefix(s1, s2)[1]
out -= min(l, dist.maxlength) * dist.p * out
end
out > max_dist ? 1.0 : out
end

View File

@ -74,7 +74,7 @@ function Distances.pairwise!(R::AbstractMatrix, dist::StringDistance, xs::Abstra
end
function _preprocess(xs, dist::QGramDistance, preprocess)
if (preprocess === true) || (isnothing(preprocess) && length(xs) >= 5)
if preprocess === nothing ? length(xs) >= 5 : preprocess
return map(x -> x === missing ? x : QGramSortedVector(x, dist.q), xs)
else
return xs

View File

@ -26,13 +26,13 @@ using StringDistances, Unicode, Test
@test compare("ab", "de", Partial(DamerauLevenshtein())) == 0
@test normalize(Partial(DamerauLevenshtein()))("ab", "cde") == 1.0
# Winkler
@test compare("martha", "marhta", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) 0.9611 atol = 1e-4
@test compare("dwayne", "duane", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) 0.84 atol = 1e-4
@test compare("dixon", "dicksonx", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) 0.81333 atol = 1e-4
@test compare("william", "williams", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) 0.975 atol = 1e-4
@test compare("", "foo", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) 0.0 atol = 1e-4
@test compare("a", "a", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) 1.0 atol = 1e-4
@test compare("abc", "xyz", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) 0.0 atol = 1e-4
@test compare("martha", "marhta", JaroWinkler()) 0.9611 atol = 1e-4
@test compare("dwayne", "duane", JaroWinkler()) 0.84 atol = 1e-4
@test compare("dixon", "dicksonx", JaroWinkler()) 0.81333 atol = 1e-4
@test compare("william", "williams", JaroWinkler()) 0.975 atol = 1e-4
@test compare("", "foo", JaroWinkler()) 0.0 atol = 1e-4
@test compare("a", "a", JaroWinkler()) 1.0 atol = 1e-4
@test compare("abc", "xyz", JaroWinkler()) 0.0 atol = 1e-4
# RatcliffObershelp
@test compare("New York Mets vs Atlanta Braves", "", RatcliffObershelp()) 0.0
@ -104,9 +104,9 @@ using StringDistances, Unicode, Test
@test findnearest("New York", ["San Francisco", "NewYork", "Newark"], Levenshtein()) == ("NewYork", 2)
@test findnearest("New York", ["Newark", "San Francisco", "NewYork"], Levenshtein()) == ("NewYork", 3)
@test findnearest("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein(); min_score = 0.99) == (nothing, nothing)
@test findnearest("New York", ["NewYork", "Newark", "San Francisco"], Jaro()) == ("NewYork", 1)
@test findnearest("New York", ["NewYork", "Newark", "San Francisco"], QGram(2)) == ("NewYork", 1)
@test findnearest("New York", ["NewYork", "Newark", "San Francisco"], normalize(QGram(2))) == ("NewYork", 1)
@test findall("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein()) == [1]