redefine modifiers
parent
bd9c7fba24
commit
730a513d8e
34
README.md
34
README.md
|
@ -12,7 +12,7 @@ The available distances are:
|
|||
|
||||
- Edit Distances
|
||||
- Hamming Distance `Hamming()`
|
||||
- [Jaro Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()`
|
||||
- [Jaro and Jaro-Winkler Distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) `Jaro()` `JaroWinkler()`
|
||||
- [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) `Levenshtein()`
|
||||
- [Damerau-Levenshtein Distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) `DamerauLevenshtein()`
|
||||
- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) `RatcliffObershelp()`
|
||||
|
@ -24,13 +24,13 @@ The available distances are:
|
|||
- [Sorensen-Dice Distance](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) `SorensenDice(q::Int)`
|
||||
- [MorisitaOverlap Distance](https://en.wikipedia.org/wiki/Morisita%27s_overlap_index) `MorisitaOverlap(q::Int)`
|
||||
- [Normalized Multiset Distance](https://www.sciencedirect.com/science/article/pii/S1047320313001417) `NMD(q::Int)`
|
||||
- Distance "modifiers" that can be applied to any distance:
|
||||
- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the minimum of the normalized distance between the shorter string and substrings of the longer string.
|
||||
- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by returning the normalized distance of the two strings, after re-ordering words alphabetically.
|
||||
- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by returning the normalized distance between the intersection of two strings with each string.
|
||||
- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) combines the normalized distance, the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses. `TokenMax(Levenshtein())` corresponds to the distance defined in [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
|
||||
- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) diminishes the normalized distance of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but it can be defined for any string distance.
|
||||
|
||||
|
||||
The package also defines Distance "modifiers" that can be applied to any distance:
|
||||
- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) returns the minimum of the distance between the shorter string and substrings of the longer string.
|
||||
- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by returning the distance of the two strings, after re-ordering words alphabetically.
|
||||
- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
|
||||
- [TokenMax](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) normalizes the distance, and combine the `Partial`, `TokenSort` and `TokenSet` modifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses. `TokenMax(Levenshtein())` corresponds to the distance defined in [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
|
||||
## Basic Use
|
||||
|
||||
### evaluate
|
||||
|
@ -49,36 +49,36 @@ Levenshtein()("martha", "marhta")
|
|||
```
|
||||
|
||||
### pairwise
|
||||
`pairwise` returns the matrix of distance between two `AbstractVectors`
|
||||
`pairwise` returns the matrix of distance between two `AbstractVectors` of AbstractStrings
|
||||
|
||||
```julia
|
||||
pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])
|
||||
```
|
||||
It is particularly fast for QGram-distances (each element is processed once).
|
||||
|
||||
### compare
|
||||
The function `compare` is defined as 1 minus the normalized distance between two strings. It always returns a `Float64` between 0.0 and 1.0: a value of 0 means completely different and a value of 1 means completely similar.
|
||||
|
||||
|
||||
### compare and find
|
||||
The function `compare` is defined as 1 minus the normalized distance between two strings. It always returns a Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.
|
||||
|
||||
```julia
|
||||
evaluate(Levenshtein(), "martha", "martha")
|
||||
Levenshtein()("martha", "martha")
|
||||
#> 0.0
|
||||
compare("martha", "martha", Levenshtein())
|
||||
#> 1.0
|
||||
```
|
||||
|
||||
|
||||
### find
|
||||
- `findnearest` returns the value and index of the element in `itr` with the lowest distance with `s`. Its syntax is:
|
||||
`findnearest` returns the value and index of the element in `itr` with the highest similarity score with `s`. Its syntax is:
|
||||
```julia
|
||||
findnearest(s, itr, dist::StringDistance; min_score = 0.0)
|
||||
findnearest(s, itr, dist::StringDistance)
|
||||
```
|
||||
|
||||
- `findall` returns the indices of all elements in `itr` with a similarity score with `s` higher than a minimum value (default to 0.8). Its syntax is:
|
||||
`findall` returns the indices of all elements in `itr` with a similarity score with `s` higher than a minimum value (default to 0.8). Its syntax is:
|
||||
```julia
|
||||
findall(s, itr, dist::StringDistance; min_score = 0.8)
|
||||
```
|
||||
|
||||
The functions `findnearest` and `findall` are particularly optimized for `Levenshtein` and `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`).
|
||||
The functions `findnearest` and `findall` are particularly optimized for `Levenshtein`, `DamerauLevenshtein` distances (as well as their modifications via `Partial`, `TokenSort`, `TokenSet`, or `TokenMax`).
|
||||
|
||||
|
||||
## References
|
||||
|
|
|
@ -5,16 +5,17 @@ using Distances
|
|||
include("distances/utils.jl")
|
||||
include("distances/edit.jl")
|
||||
include("distances/qgram.jl")
|
||||
include("modifiers.jl")
|
||||
include("normalize.jl")
|
||||
|
||||
const StringDistance = Union{Hamming, Jaro, Levenshtein, DamerauLevenshtein, RatcliffObershelp, QGramDistance, Winkler, Partial, TokenSort, TokenSet, TokenMax, Normalize}
|
||||
include("find.jl")
|
||||
include("pairwise.jl")
|
||||
# Distances API
|
||||
Distances.result_type(dist::StringDistance, s1::Type, s2::Type) = typeof(dist("", ""))
|
||||
Distances.result_type(dist::StringDistance, s1, s2) = result_type(dist, typeof(s1), typeof(s2))
|
||||
|
||||
|
||||
include("find.jl")
|
||||
include("pairwise.jl")
|
||||
|
||||
|
||||
|
||||
##############################################################################
|
||||
##
|
||||
|
@ -28,18 +29,18 @@ Hamming,
|
|||
Levenshtein,
|
||||
DamerauLevenshtein,
|
||||
Jaro,
|
||||
JaroWinkler,
|
||||
RatcliffObershelp,
|
||||
QGramDistance,
|
||||
QGram,
|
||||
QGramDict,
|
||||
QGramSortedVector,
|
||||
Cosine,
|
||||
Jaccard,
|
||||
SorensenDice,
|
||||
Overlap,
|
||||
MorisitaOverlap,
|
||||
NMD,
|
||||
QGramDict,
|
||||
QGramSortedVector,
|
||||
Winkler,
|
||||
Partial,
|
||||
TokenSort,
|
||||
TokenSet,
|
||||
|
|
|
@ -12,13 +12,13 @@ Hamming() = Hamming(nothing)
|
|||
|
||||
function (dist::Hamming)(s1, s2)
|
||||
((s1 === missing) | (s2 === missing)) && return missing
|
||||
current = abs(length(s2) - length(s1))
|
||||
dist.max_dist !== nothing && current > dist.max_dist && return dist.max_dist + 1
|
||||
out = abs(length(s2) - length(s1))
|
||||
dist.max_dist !== nothing && out > dist.max_dist && return dist.max_dist + 1
|
||||
for (ch1, ch2) in zip(s1, s2)
|
||||
current += ch1 != ch2
|
||||
dist.max_dist !== nothing && current > dist.max_dist && return dist.max_dist + 1
|
||||
out += ch1 != ch2
|
||||
dist.max_dist !== nothing && out > dist.max_dist && return dist.max_dist + 1
|
||||
end
|
||||
return current
|
||||
return out
|
||||
end
|
||||
|
||||
|
||||
|
@ -73,6 +73,37 @@ function (dist::Jaro)(s1, s2)
|
|||
return 1.0 - (m / len1 + m / len2 + (m - t/2) / m) / 3.0
|
||||
end
|
||||
|
||||
|
||||
"""
|
||||
JaroWinkler(;p = 0.1, threshold = 0.3, maxlength = 4)
|
||||
|
||||
Creates the JaroWinkler distance
|
||||
|
||||
The JaroWinkler distance is defined as the Jaro distance, which is multiplied by
|
||||
``(1-min(l, maxlength) * p)`` as long as it is lower than `threshold`, and where `l` denotes the length of the common prefix.
|
||||
"""
|
||||
struct JaroWinkler <: SemiMetric
|
||||
p::Float64 # scaling factor. Default to 0.1
|
||||
threshold::Float64 # boost limit. Default to 0.3
|
||||
maxlength::Integer # max length of common prefix. Default to 4
|
||||
end
|
||||
|
||||
JaroWinkler(; p = 0.1, threshold = 0.3, maxlength = 4) = JaroWinkler(p, threshold, maxlength)
|
||||
|
||||
## http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html
|
||||
function (dist::JaroWinkler)(s1, s2)
|
||||
((s1 === missing) | (s2 === missing)) && return missing
|
||||
s1, s2 = reorder(s1, s2)
|
||||
len1, len2 = length(s1), length(s2)
|
||||
out = Jaro()(s1, s2)
|
||||
if out <= dist.threshold
|
||||
l = common_prefix(s1, s2)[1]
|
||||
out = (1 - min(l, dist.maxlength) * dist.p) * out
|
||||
end
|
||||
return out
|
||||
end
|
||||
|
||||
|
||||
"""
|
||||
Levenshtein()
|
||||
|
||||
|
|
13
src/find.jl
13
src/find.jl
|
@ -1,3 +1,5 @@
|
|||
const StringDistance = Union{Hamming, Jaro, JaroWinkler,Levenshtein, DamerauLevenshtein, RatcliffObershelp, QGramDistance, Partial, TokenSort, TokenSet, TokenMax, Normalized}
|
||||
|
||||
"""
|
||||
compare(s1, s2, dist)
|
||||
|
||||
|
@ -10,16 +12,15 @@ julia> compare("martha", "marhta", Levenshtein())
|
|||
0.6666666666666667
|
||||
```
|
||||
"""
|
||||
compare(s1, s2, dist::StringDistance; min_score = 0.0) = 1 - normalize(dist)(s1, s2, 1 - min_score)
|
||||
|
||||
function compare(s1, s2, dist::StringDistance; min_score = 0.0)
|
||||
1 - normalize(dist, max_dist = 1 - min_score)(s1, s2)
|
||||
end
|
||||
|
||||
"""
|
||||
findnearest(s, itr, dist::StringDistance; min_score = 0.0) -> (x, index)
|
||||
findnearest(s, itr, dist::StringDistance) -> (x, index)
|
||||
|
||||
`findnearest` returns the value and index of the element of `itr` that has the
|
||||
highest similarity score with `s` according to the distance `dist`.
|
||||
It returns `(nothing, nothing)` if none of the elements has a similarity score
|
||||
higher or equal to `min_score` (default to 0.0).
|
||||
lowest distance with `s` according to the distance `dist`.
|
||||
|
||||
It is particularly optimized for [`Levenshtein`](@ref) and [`DamerauLevenshtein`](@ref) distances
|
||||
(as well as their modifications via [`Partial`](@ref), [`TokenSort`](@ref), [`TokenSet`](@ref), or [`TokenMax`](@ref)).
|
||||
|
|
|
@ -0,0 +1,121 @@
|
|||
"""
|
||||
Partial(dist)
|
||||
|
||||
Creates the `Partial{dist}` distance.
|
||||
|
||||
`Partial{dist}` returns the minimum distance between the shorter string and substrings of the longer string (of the size of the shorter stirng)
|
||||
|
||||
See: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
|
||||
|
||||
### Examples
|
||||
```julia-repl
|
||||
julia> s1 = "New York Mets vs Atlanta Braves"
|
||||
julia> s2 = "Atlanta Braves vs New York Mets"
|
||||
julia> evaluate(Partial(RatcliffObershelp()), s1, s2)
|
||||
0.5483870967741935
|
||||
```
|
||||
"""
|
||||
struct Partial{S <: SemiMetric} <: SemiMetric
|
||||
dist::S
|
||||
end
|
||||
|
||||
function (dist::Partial)(s1, s2)
|
||||
s1, s2 = reorder(s1, s2)
|
||||
len1, len2 = length(s1), length(s2)
|
||||
out = dist.dist(s1, s2)
|
||||
((len1 == 0) | (len1 == len2)) && return out
|
||||
for x in qgrams(s2, len1)
|
||||
curr = dist.dist(s1, x)
|
||||
out = min(out, curr)
|
||||
end
|
||||
return out
|
||||
end
|
||||
|
||||
function (dist::Partial{RatcliffObershelp})(s1, s2)
|
||||
s1, s2 = reorder(s1, s2)
|
||||
len1, len2 = length(s1), length(s2)
|
||||
len1 == len2 && return dist.dist(s1, s2)
|
||||
out = 1.0
|
||||
for r in matching_blocks(s1, s2)
|
||||
# Make sure the substring of s2 has length len1
|
||||
s2_start = r[2] - r[1] + 1
|
||||
s2_end = s2_start + len1 - 1
|
||||
if s2_start < 1
|
||||
s2_end += 1 - s2_start
|
||||
s2_start += 1 - s2_start
|
||||
elseif s2_end > len2
|
||||
s2_start += len2 - s2_end
|
||||
s2_end += len2 - s2_end
|
||||
end
|
||||
curr = dist.dist(s1, _slice(s2, s2_start - 1, s2_end))
|
||||
out = min(out, curr)
|
||||
end
|
||||
return out
|
||||
end
|
||||
|
||||
"""
|
||||
TokenSort(dist)
|
||||
|
||||
Creates the `TokenSort{dist}` distance.
|
||||
|
||||
`TokenSort{dist}` returns the distance between strings after reording words alphabetically.
|
||||
|
||||
See: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
|
||||
|
||||
### Examples
|
||||
```julia-repl
|
||||
julia> s1 = "New York Mets vs Atlanta Braves"
|
||||
julia> s1 = "New York Mets vs Atlanta Braves"
|
||||
julia> s2 = "Atlanta Braves vs New York Mets"
|
||||
julia> evaluate(TokenSort(RatcliffObershelp()), s1, s2)
|
||||
0.0
|
||||
```
|
||||
"""
|
||||
struct TokenSort{S <: SemiMetric} <: SemiMetric
|
||||
dist::S
|
||||
end
|
||||
|
||||
# http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
|
||||
function (dist::TokenSort)(s1::AbstractString, s2::AbstractString)
|
||||
s1 = join(sort!(split(s1)), " ")
|
||||
s2 = join(sort!(split(s2)), " ")
|
||||
out = dist.dist(s1, s2)
|
||||
end
|
||||
|
||||
"""
|
||||
TokenSet(dist)
|
||||
|
||||
Creates the `TokenSet{dist}` distance.
|
||||
|
||||
`TokenSet{dist}` compares the intersection of two strings with each string, after reording words alphabetically
|
||||
|
||||
See: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
|
||||
|
||||
### Examples
|
||||
```julia-repl
|
||||
julia> s1 = "New York Mets vs Atlanta"
|
||||
julia> s2 = "Atlanta Braves vs New York Mets"
|
||||
julia> evaluate(TokenSet(RatcliffObershelp()), s1, s2)
|
||||
0.0
|
||||
```
|
||||
"""
|
||||
struct TokenSet{S <: SemiMetric} <: SemiMetric
|
||||
dist::S
|
||||
end
|
||||
|
||||
# http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
|
||||
function (dist::TokenSet)(s1::AbstractString, s2::AbstractString)
|
||||
v1 = unique!(sort!(split(s1)))
|
||||
v2 = unique!(sort!(split(s2)))
|
||||
v0 = intersect(v1, v2)
|
||||
s0 = join(v0, " ")
|
||||
s1 = join(v1, " ")
|
||||
s2 = join(v2, " ")
|
||||
isempty(s0) && return dist.dist(s1, s2)
|
||||
score_01 = dist.dist(s0, s1)
|
||||
score_02 = dist.dist(s0, s2)
|
||||
score_12 = dist.dist(s1, s2)
|
||||
min(score_01, score_02, score_12)
|
||||
end
|
||||
|
||||
|
233
src/normalize.jl
233
src/normalize.jl
|
@ -1,41 +1,34 @@
|
|||
|
||||
struct Normalize{S <: SemiMetric} <: SemiMetric
|
||||
dist::S
|
||||
struct Normalized{V <: SemiMetric} <: SemiMetric
|
||||
dist::V
|
||||
max_dist::Float64
|
||||
end
|
||||
|
||||
"""
|
||||
normalize(dist::SemiMetric)
|
||||
|
||||
Normalize a metric, so that `evaluate` always return a Float64 between 0 and 1
|
||||
"""
|
||||
normalize(dist::SemiMetric, max_dist = 1.0) = Normalize(dist)
|
||||
normalize(dist::Normalize, max_dist = 1.0) = Normalize(dist.dist)
|
||||
|
||||
function (dist::Normalize{<:Hamming})(s1, s2, max_dist = 1.0)
|
||||
function (dist::Normalized{<:Hamming})(s1, s2)
|
||||
((s1 === missing) | (s2 === missing)) && return missing
|
||||
s1, s2 = reorder(s1, s2)
|
||||
len1, len2 = length(s1), length(s2)
|
||||
len2 == 0 && return 1.0
|
||||
out = dist.dist(s1, s2) / len2
|
||||
out > max_dist ? 1.0 : out
|
||||
out > dist.max_dist ? 1.0 : out
|
||||
end
|
||||
|
||||
# A normalized distance is between 0 and 1, and accept a third argument, max_dist.
|
||||
function (dist::Normalize{<: Union{Levenshtein, DamerauLevenshtein}})(s1, s2, max_dist = 1.0)
|
||||
function (dist::Normalized{<:Union{Levenshtein{Nothing}, DamerauLevenshtein{Nothing}}})(s1, s2)
|
||||
((s1 === missing) | (s2 === missing)) && return missing
|
||||
s1, s2 = reorder(s1, s2)
|
||||
len1, len2 = length(s1), length(s2)
|
||||
len2 == 0 && return 1.0
|
||||
if dist.dist isa Levenshtein
|
||||
d = Levenshtein(ceil(Int, len2 * max_dist))(s1, s2)
|
||||
d = Levenshtein(ceil(Int, len2 * dist.max_dist))(s1, s2)
|
||||
else
|
||||
d = DamerauLevenshtein(ceil(Int, len2 * max_dist))(s1, s2)
|
||||
d = DamerauLevenshtein(ceil(Int, len2 * dist.max_dist))(s1, s2)
|
||||
end
|
||||
out = d / len2
|
||||
out > max_dist ? 1.0 : out
|
||||
out > dist.max_dist ? 1.0 : out
|
||||
end
|
||||
|
||||
function (dist::Normalize{<: QGramDistance})(s1, s2, max_dist = 1.0)
|
||||
function (dist::Normalized{<:QGramDistance})(s1, s2)
|
||||
((s1 === missing) | (s2 === missing)) && return missing
|
||||
# When string length < q for qgram distance, returns s1 == s2
|
||||
s1, s2 = reorder(s1, s2)
|
||||
|
@ -46,143 +39,22 @@ function (dist::Normalize{<: QGramDistance})(s1, s2, max_dist = 1.0)
|
|||
else
|
||||
out = dist.dist(s1, s2)
|
||||
end
|
||||
out > max_dist ? 1.0 : out
|
||||
out > dist.max_dist ? 1.0 : out
|
||||
end
|
||||
|
||||
function (dist::Normalize)(s1, s2, max_dist = 1.0)
|
||||
function (dist::Normalized)(s1, s2)
|
||||
out = dist.dist(s1, s2)
|
||||
out > max_dist ? 1.0 : out
|
||||
out > dist.max_dist ? 1.0 : out
|
||||
end
|
||||
|
||||
"""
|
||||
Partial(dist)
|
||||
|
||||
Creates the `Partial{dist}` distance.
|
||||
|
||||
`Partial{dist}` normalizes the string distance `dist` and modify it to return the
|
||||
minimum distance between the shorter string and substrings of the longer string
|
||||
|
||||
### Examples
|
||||
```julia-repl
|
||||
julia> s1 = "New York Mets vs Atlanta Braves"
|
||||
julia> s2 = "Atlanta Braves vs New York Mets"
|
||||
julia> evaluate(Partial(RatcliffObershelp()), s1, s2)
|
||||
0.5483870967741935
|
||||
```
|
||||
"""
|
||||
struct Partial{S <: SemiMetric} <: SemiMetric
|
||||
dist::S
|
||||
Partial{S}(dist::S) where {S <: SemiMetric} = new(dist)
|
||||
end
|
||||
Partial(dist::SemiMetric) = Partial{typeof(normalize(dist))}(normalize(dist))
|
||||
normalize(dist::Partial) = dist
|
||||
|
||||
function (dist::Partial)(s1, s2, max_dist = 1.0)
|
||||
s1, s2 = reorder(s1, s2)
|
||||
len1, len2 = length(s1), length(s2)
|
||||
out = dist.dist(s1, s2, max_dist)
|
||||
len1 == len2 && return out
|
||||
len1 == 0 && return out
|
||||
for x in qgrams(s2, len1)
|
||||
curr = dist.dist(s1, x, max_dist)
|
||||
out = min(out, curr)
|
||||
max_dist = min(out, max_dist)
|
||||
end
|
||||
return out
|
||||
end
|
||||
|
||||
function (dist::Partial{Normalize{RatcliffObershelp}})(s1, s2, max_dist = 1.0)
|
||||
s1, s2 = reorder(s1, s2)
|
||||
len1, len2 = length(s1), length(s2)
|
||||
len1 == len2 && return dist.dist(s1, s2)
|
||||
out = 1.0
|
||||
for r in matching_blocks(s1, s2)
|
||||
# Make sure the substring of s2 has length len1
|
||||
s2_start = r[2] - r[1] + 1
|
||||
s2_end = s2_start + len1 - 1
|
||||
if s2_start < 1
|
||||
s2_end += 1 - s2_start
|
||||
s2_start += 1 - s2_start
|
||||
elseif s2_end > len2
|
||||
s2_start += len2 - s2_end
|
||||
s2_end += len2 - s2_end
|
||||
end
|
||||
curr = dist.dist(s1, _slice(s2, s2_start - 1, s2_end))
|
||||
out = min(out, curr)
|
||||
end
|
||||
return out
|
||||
end
|
||||
|
||||
"""
|
||||
TokenSort(dist)
|
||||
|
||||
Creates the `TokenSort{dist}` distance.
|
||||
|
||||
`TokenSort{dist}` normalizes the string distance `dist` and modify it to adjust for differences
|
||||
in word orders by reording words alphabetically.
|
||||
|
||||
### Examples
|
||||
```julia-repl
|
||||
julia> s1 = "New York Mets vs Atlanta Braves"
|
||||
julia> s1 = "New York Mets vs Atlanta Braves"
|
||||
julia> s2 = "Atlanta Braves vs New York Mets"
|
||||
julia> evaluate(TokenSort(RatcliffObershelp()), s1, s2)
|
||||
0.0
|
||||
```
|
||||
"""
|
||||
struct TokenSort{S <: SemiMetric} <: SemiMetric
|
||||
dist::S
|
||||
TokenSort{S}(dist::S) where {S <: SemiMetric} = new(dist)
|
||||
end
|
||||
TokenSort(dist::SemiMetric) = TokenSort{typeof(normalize(dist))}(normalize(dist))
|
||||
normalize(dist::TokenSort) = dist
|
||||
|
||||
# http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
|
||||
function (dist::TokenSort)(s1::AbstractString, s2::AbstractString, max_dist = 1.0)
|
||||
s1 = join(sort!(split(s1)), " ")
|
||||
s2 = join(sort!(split(s2)), " ")
|
||||
out = dist.dist(s1, s2, max_dist)
|
||||
end
|
||||
|
||||
"""
|
||||
TokenSet(dist)
|
||||
|
||||
Creates the `TokenSet{dist}` distance.
|
||||
|
||||
`TokenSet{dist}` normalizes the string distance `dist` and modify it to adjust for differences
|
||||
in word orders and word numbers by comparing the intersection of two strings with each string.
|
||||
|
||||
### Examples
|
||||
```julia-repl
|
||||
julia> s1 = "New York Mets vs Atlanta"
|
||||
julia> s2 = "Atlanta Braves vs New York Mets"
|
||||
julia> evaluate(TokenSet(RatcliffObershelp()), s1, s2)
|
||||
0.0
|
||||
```
|
||||
"""
|
||||
struct TokenSet{S <: SemiMetric} <: SemiMetric
|
||||
dist::S
|
||||
TokenSet{S}(dist::S) where {S <: SemiMetric} = new(dist)
|
||||
end
|
||||
TokenSet(dist::SemiMetric) = TokenSet{typeof(normalize(dist))}(normalize(dist))
|
||||
normalize(dist::TokenSet) = dist
|
||||
|
||||
# http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
|
||||
function (dist::TokenSet)(s1::AbstractString, s2::AbstractString, max_dist = 1.0)
|
||||
v1 = unique!(sort!(split(s1)))
|
||||
v2 = unique!(sort!(split(s2)))
|
||||
v0 = intersect(v1, v2)
|
||||
s0 = join(v0, " ")
|
||||
s1 = join(v1, " ")
|
||||
s2 = join(v2, " ")
|
||||
isempty(s0) && return dist.dist(s1, s2, max_dist)
|
||||
score_01 = dist.dist(s0, s1, max_dist)
|
||||
max_dist = min(max_dist, score_01)
|
||||
score_02 = dist.dist(s0, s2, max_dist)
|
||||
max_dist = min(max_dist, score_02)
|
||||
score_12 = dist.dist(s1, s2, max_dist)
|
||||
min(score_01, score_02, score_12)
|
||||
end
|
||||
normalize(dist::SemiMetric; max_dist = 1.0) = Normalized{typeof(dist)}(dist, max_dist)
|
||||
normalize(dist::Union{Jaro, JaroWinkler}; max_dist = 1.0) = dist
|
||||
normalize(dist::Partial; max_dist = 1.0) = Partial(normalize(dist.dist; max_dist = max_dist))
|
||||
normalize(dist::TokenSort; max_dist = 1.0) = TokenSort(normalize(dist.dist; max_dist = max_dist))
|
||||
normalize(dist::TokenSet; max_dist = 1.0) = TokenSet(normalize(dist.dist; max_dist = max_dist))
|
||||
normalize(dist::Normalized; max_dist = 1.0) = Normalized{typeof(dist.dist)}(dist.dist, max_dist)
|
||||
|
||||
"""
|
||||
TokenMax(dist)
|
||||
|
@ -207,69 +79,38 @@ struct TokenMax{S <: SemiMetric} <: SemiMetric
|
|||
end
|
||||
|
||||
TokenMax(dist::SemiMetric) = TokenMax{typeof(normalize(dist))}(normalize(dist))
|
||||
normalize(dist::TokenMax) = dist
|
||||
function normalize(dist::TokenMax; max_dist = 1.0)
|
||||
dist = normalize(dist.dist; max_dist = max_dist)
|
||||
TokenMax{typeof(dist)}(dist)
|
||||
end
|
||||
|
||||
function (dist::TokenMax)(s1::AbstractString, s2::AbstractString, max_dist = 1.0)
|
||||
function (dist::TokenMax)(s1::AbstractString, s2::AbstractString)
|
||||
s1, s2 = reorder(s1, s2)
|
||||
len1, len2 = length(s1), length(s2)
|
||||
score = dist.dist(s1, s2, max_dist)
|
||||
_dist = deepcopy(dist.dist)
|
||||
max_dist = _dist.max_dist
|
||||
score = _dist(s1, s2)
|
||||
min_score = min(max_dist, score)
|
||||
unbase_scale = 0.95
|
||||
# if one string is much shorter than the other, use partial
|
||||
if length(s2) >= 1.5 * length(s1)
|
||||
partial_dist = Partial(dist.dist)
|
||||
partial_scale = length(s2) > (8 * length(s1)) ? 0.6 : 0.9
|
||||
score_partial = 1 - partial_scale * (1 - partial_dist(s1, s2, 1 - (1 - max_dist) / partial_scale))
|
||||
_dist = Normalized(_dist.dist, 1 - (1 - max_dist) / partial_scale)
|
||||
score_partial = 1 - partial_scale * (1 - Partial(_dist)(s1, s2))
|
||||
min_score = min(max_dist, score_partial)
|
||||
score_sort = 1 - unbase_scale * partial_scale *
|
||||
(1 - TokenSort(partial_dist)(s1, s2, 1 - (1 - max_dist) / (unbase_scale * partial_scale)))
|
||||
_dist = Normalized(_dist.dist, 1 - (1 - max_dist) / (unbase_scale * partial_scale))
|
||||
score_sort = 1 - unbase_scale * partial_scale * (1 - TokenSort(Partial(_dist))(s1, s2))
|
||||
max_dist = min(max_dist, score_sort)
|
||||
score_set = 1 - unbase_scale * partial_scale *
|
||||
(1 - TokenSet(partial_dist)(s1, s2, 1 - (1 - max_dist) / (unbase_scale * partial_scale)))
|
||||
_dist = Normalized(_dist.dist, 1 - (1 - max_dist) / (unbase_scale * partial_scale))
|
||||
score_set = 1 - unbase_scale * partial_scale * (1 - TokenSet(Partial(_dist))(s1, s2))
|
||||
out = min(score, score_partial, score_sort, score_set)
|
||||
else
|
||||
score_sort = 1 - unbase_scale *
|
||||
(1 - TokenSort(dist.dist)(s1, s2, 1 - (1 - max_dist) / unbase_scale))
|
||||
_dist = Normalized(_dist.dist, 1 - (1 - max_dist) / unbase_scale)
|
||||
score_sort = 1 - unbase_scale * (1 - TokenSort(_dist)(s1, s2))
|
||||
max_dist = min(max_dist, score_sort)
|
||||
score_set = 1 - unbase_scale *
|
||||
(1 - TokenSet(dist.dist)(s1, s2, 1 - (1 - max_dist) / unbase_scale))
|
||||
_dist = Normalized(_dist.dist, 1 - (1 - max_dist) / unbase_scale)
|
||||
score_set = 1 - unbase_scale * (1 - TokenSet(_dist)(s1, s2))
|
||||
out = min(score, score_sort, score_set)
|
||||
end
|
||||
out > max_dist ? 1.0 : out
|
||||
end
|
||||
|
||||
"""
|
||||
Winkler(dist; p::Real = 0.1, threshold::Real = 0.7, maxlength::Integer = 4)
|
||||
|
||||
Creates the `Winkler{dist, p, threshold, maxlength}` distance.
|
||||
|
||||
`Winkler{dist, p, threshold, length)` normalizes the string distance `dist` and modify it to decrease the
|
||||
distance between two strings, when their original distance is below some `threshold`.
|
||||
The boost is equal to `min(l, maxlength) * p * dist` where `l` denotes the
|
||||
length of their common prefix and `dist` denotes the original distance
|
||||
"""
|
||||
struct Winkler{S <: SemiMetric} <: SemiMetric
|
||||
dist::S
|
||||
p::Float64 # scaling factor. Default to 0.1
|
||||
threshold::Float64 # boost threshold. Default to 0.7
|
||||
maxlength::Integer # max length of common prefix. Default to 4
|
||||
Winkler{S}(dist::S, p, threshold, maxlength) where {S <: SemiMetric} = new(dist, p, threshold, maxlength)
|
||||
end
|
||||
|
||||
function Winkler(dist::SemiMetric; p = 0.1, threshold = 0.7, maxlength = 4)
|
||||
p * maxlength <= 1 || throw("scaling factor times maxlength of common prefix must be lower than one")
|
||||
dist = normalize(dist)
|
||||
Winkler{typeof(dist)}(dist, 0.1, 0.7, 4)
|
||||
end
|
||||
normalize(dist::Winkler) = dist
|
||||
|
||||
function (dist::Winkler)(s1, s2, max_dist = 1.0)
|
||||
# cannot do max_dist because of boosting threshold
|
||||
out = dist.dist(s1, s2)
|
||||
if out <= 1 - dist.threshold
|
||||
l = common_prefix(s1, s2)[1]
|
||||
out -= min(l, dist.maxlength) * dist.p * out
|
||||
end
|
||||
out > max_dist ? 1.0 : out
|
||||
end
|
||||
|
||||
|
|
|
@ -74,7 +74,7 @@ function Distances.pairwise!(R::AbstractMatrix, dist::StringDistance, xs::Abstra
|
|||
end
|
||||
|
||||
function _preprocess(xs, dist::QGramDistance, preprocess)
|
||||
if (preprocess === true) || (isnothing(preprocess) && length(xs) >= 5)
|
||||
if preprocess === nothing ? length(xs) >= 5 : preprocess
|
||||
return map(x -> x === missing ? x : QGramSortedVector(x, dist.q), xs)
|
||||
else
|
||||
return xs
|
||||
|
|
|
@ -26,13 +26,13 @@ using StringDistances, Unicode, Test
|
|||
@test compare("ab", "de", Partial(DamerauLevenshtein())) == 0
|
||||
@test normalize(Partial(DamerauLevenshtein()))("ab", "cde") == 1.0
|
||||
# Winkler
|
||||
@test compare("martha", "marhta", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) ≈ 0.9611 atol = 1e-4
|
||||
@test compare("dwayne", "duane", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) ≈ 0.84 atol = 1e-4
|
||||
@test compare("dixon", "dicksonx", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) ≈ 0.81333 atol = 1e-4
|
||||
@test compare("william", "williams", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) ≈ 0.975 atol = 1e-4
|
||||
@test compare("", "foo", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) ≈ 0.0 atol = 1e-4
|
||||
@test compare("a", "a", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) ≈ 1.0 atol = 1e-4
|
||||
@test compare("abc", "xyz", Winkler(Jaro(), p = 0.1, threshold = 0.0, maxlength = 4)) ≈ 0.0 atol = 1e-4
|
||||
@test compare("martha", "marhta", JaroWinkler()) ≈ 0.9611 atol = 1e-4
|
||||
@test compare("dwayne", "duane", JaroWinkler()) ≈ 0.84 atol = 1e-4
|
||||
@test compare("dixon", "dicksonx", JaroWinkler()) ≈ 0.81333 atol = 1e-4
|
||||
@test compare("william", "williams", JaroWinkler()) ≈ 0.975 atol = 1e-4
|
||||
@test compare("", "foo", JaroWinkler()) ≈ 0.0 atol = 1e-4
|
||||
@test compare("a", "a", JaroWinkler()) ≈ 1.0 atol = 1e-4
|
||||
@test compare("abc", "xyz", JaroWinkler()) ≈ 0.0 atol = 1e-4
|
||||
|
||||
# RatcliffObershelp
|
||||
@test compare("New York Mets vs Atlanta Braves", "", RatcliffObershelp()) ≈ 0.0
|
||||
|
@ -104,9 +104,9 @@ using StringDistances, Unicode, Test
|
|||
@test findnearest("New York", ["San Francisco", "NewYork", "Newark"], Levenshtein()) == ("NewYork", 2)
|
||||
@test findnearest("New York", ["Newark", "San Francisco", "NewYork"], Levenshtein()) == ("NewYork", 3)
|
||||
|
||||
@test findnearest("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein(); min_score = 0.99) == (nothing, nothing)
|
||||
|
||||
@test findnearest("New York", ["NewYork", "Newark", "San Francisco"], Jaro()) == ("NewYork", 1)
|
||||
@test findnearest("New York", ["NewYork", "Newark", "San Francisco"], QGram(2)) == ("NewYork", 1)
|
||||
@test findnearest("New York", ["NewYork", "Newark", "San Francisco"], normalize(QGram(2))) == ("NewYork", 1)
|
||||
|
||||
|
||||
@test findall("New York", ["NewYork", "Newark", "San Francisco"], Levenshtein()) == [1]
|
||||
|
|
Loading…
Reference in New Issue