add RatcliffObershelp
parent
99b997c9e2
commit
aa4c75a340
86
README.md
86
README.md
|
@ -2,43 +2,95 @@
|
|||
[![Coverage Status](https://coveralls.io/repos/matthieugomez/StringDistances.jl/badge.svg?branch=master)](https://coveralls.io/r/matthieugomez/StringDistances.jl?branch=master)
|
||||
[![StringDistances](http://pkg.julialang.org/badges/StringDistances_0.4.svg)](http://pkg.julialang.org/?pkg=StringDistances)
|
||||
|
||||
StringDistances allow to compute various distances between strings. The package should work with any `AbstractString` (in particular ASCII and UTF-8)
|
||||
This Julia package computes various distances between strings.
|
||||
|
||||
|
||||
|
||||
## Distances
|
||||
|
||||
#### Edit Distances
|
||||
- Hamming Distance
|
||||
- Jaro Distance
|
||||
- Levenshtein Distance
|
||||
- Damerau-Levenshtein Distance
|
||||
- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) (similar to the Python library [difflib](https://docs.python.org/2/library/difflib.html))
|
||||
|
||||
#### Q-Grams Distances
|
||||
- QGram Distance
|
||||
- Cosine Distance
|
||||
- Jaccard Distance
|
||||
|
||||
|
||||
A good reference about string distances is the article written for the R package `stringdist`:
|
||||
A good reference for q-gram distances is the article written for the R package `stringdist`:
|
||||
*The stringdist Package for Approximate String Matching* Mark P.J. van der Loo
|
||||
|
||||
|
||||
## Syntax
|
||||
- The basic syntax follows the [Distances](https://github.com/JuliaStats/Distances.jl) package:
|
||||
|
||||
|
||||
|
||||
#### evaluate
|
||||
The function `evaluate` returns the litteral distance between two strings (a value of 0 being identical). While some distances are bounded by 1, other distances like `Hamming`, `Levenshtein`, `Damerau-Levenshtein`, `Jaccard` can be higher than 1.
|
||||
|
||||
```julia
|
||||
using StringDistances
|
||||
evaluate(Hamming(), "martha", "marhta")
|
||||
#> 2
|
||||
evaluate(QGram(2), "martha", "marhta")
|
||||
#> 6
|
||||
```
|
||||
|
||||
#### compare
|
||||
The higher level function `compare` directly computes for any distance a similarity score between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.
|
||||
```julia
|
||||
using StringDistances
|
||||
compare(Hamming(), "martha", "marhta")
|
||||
#> 0.6666666666666667
|
||||
compare(QGram(2), "martha", "marhta")
|
||||
#> 0.4
|
||||
```
|
||||
|
||||
|
||||
## Modifiers
|
||||
|
||||
The package defines a number of types to modify string metrics:
|
||||
|
||||
- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes
|
||||
|
||||
```julia
|
||||
using StringDistances
|
||||
evaluate(Hamming(), "martha", "marhta")
|
||||
evaluate(QGram(2), "martha", "marhta")
|
||||
compare(Jaro(), "martha", "marhta")
|
||||
#> 0.9444444444444445
|
||||
compare(Winkler(Jaro()), "martha", "marhta")
|
||||
#> 0.9611111111111111
|
||||
```
|
||||
|
||||
- Normalize a distance between 0-1 with `Normalized`
|
||||
The Winkler adjustment was originally defined for the Jaro distance but this package defines it for any string distance.
|
||||
|
||||
```julia
|
||||
evaluate(Normalized(Hamming()), "martha", "marhta")
|
||||
evaluate(Normalized(QGram(2)), "martha", "marhta")
|
||||
compare(QGram(2), "william", "williams")
|
||||
#> 0.9230769230769231
|
||||
compare(Winkler(QGram(2)), "william", "williams")
|
||||
#> 0.9538461538461539
|
||||
```
|
||||
|
||||
- Add a [Winkler adjustment](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) with `Winkler`
|
||||
- For strings composed of several words, the Python library [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) defines a few modifiers for the `RatcliffObershelp` distance. This package defines them for any string distance:
|
||||
|
||||
- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in string lengths. The function returns the maximal similarity score between the shorter string and all substrings of the longer string.
|
||||
|
||||
```julia
|
||||
compare(Partial(Hamming()), "New York Yankees", "Yankees")
|
||||
#> 1.0
|
||||
```
|
||||
|
||||
- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically.
|
||||
|
||||
```julia
|
||||
compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")
|
||||
#> 1.0
|
||||
```
|
||||
|
||||
- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers.
|
||||
|
||||
```julia
|
||||
compare(TokenSet(RatcliffObershelp()),"mariners vs angels", "los angeles angels of anaheim at seattle mariners")
|
||||
```
|
||||
|
||||
|
||||
```julia
|
||||
evaluate(Winkler(Jaro()), "martha", "marhta")
|
||||
evaluate(Winkler(Qgram(2)), "martha", "marhta")
|
||||
```
|
||||
While the Winkler adjustment was originally defined in the context of the Jaro distance, it can be helpful with other distances too. Note: a distance is automatically normalized between 0 and 1 when used with a Winkler adjustment.
|
||||
|
|
|
@ -19,6 +19,9 @@ end
|
|||
@time f(Float64, Cosine(2), x, y)
|
||||
@time f(Float64, Jaccard(2), x, y)
|
||||
|
||||
#
|
||||
@time f(Float64, RatcliffObershelp(), x, y)
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -9,33 +9,34 @@ module StringDistances
|
|||
##############################################################################
|
||||
|
||||
import Distances: evaluate, Hamming, hamming, PreMetric, SemiMetric
|
||||
export evaluate,
|
||||
Hamming, hamming,
|
||||
Levenshtein, levenshtein,
|
||||
DamerauLevenshtein, damerau_levenshtein,
|
||||
Jaro, jaro,
|
||||
QGram, qgram,
|
||||
Cosine, cosine,
|
||||
Jaccard, jaccard,
|
||||
Normalized,
|
||||
Winkler
|
||||
import Iterators: chain
|
||||
export
|
||||
evaluate,
|
||||
compare,
|
||||
Hamming,
|
||||
Levenshtein,
|
||||
DamerauLevenshtein,
|
||||
Jaro,
|
||||
QGram,
|
||||
Cosine,
|
||||
Jaccard,
|
||||
longest_common_substring,
|
||||
matching_blocks,
|
||||
RatcliffObershelp,
|
||||
Winkler,
|
||||
Partial,
|
||||
TokenSort,
|
||||
TokenSet
|
||||
|
||||
include("distances/evaluate.jl")
|
||||
include("distances/edit.jl")
|
||||
include("distances/qgram.jl")
|
||||
include("distances/RatcliffObershelp.jl")
|
||||
|
||||
# 1. only do the switch once
|
||||
# 2. precomputes length(s1), length(s2)
|
||||
function evaluate(dist::PreMetric, s1::AbstractString, s2::AbstractString, x...)
|
||||
len1, len2 = length(s1), length(s2)
|
||||
if len1 > len2
|
||||
return evaluate(dist, s2, s1, len2, len1, x...)
|
||||
else
|
||||
return evaluate(dist, s1, s2, len1, len2, x...)
|
||||
end
|
||||
end
|
||||
|
||||
include("edit.jl")
|
||||
include("qgram.jl")
|
||||
include("normalized.jl")
|
||||
include("winkler.jl")
|
||||
include("modifiers/compare.jl")
|
||||
include("modifiers/winkler.jl")
|
||||
include("modifiers/tokenize.jl")
|
||||
include("modifiers/partial.jl")
|
||||
|
||||
|
||||
end
|
|
@ -0,0 +1,57 @@
|
|||
# Return a character index, not a byte index
|
||||
function longest_common_substring(s1::AbstractString, s2::AbstractString)
|
||||
len2 = length(s2)
|
||||
start1, start2, size = 0, 0, 0
|
||||
p = zeros(Int, len2)
|
||||
i1 = 0
|
||||
for ch1 in s1
|
||||
i1 += 1
|
||||
i2 = 0
|
||||
oldp = 0
|
||||
for ch2 in s2
|
||||
i2 += 1
|
||||
newp = 0
|
||||
if ch1 == ch2
|
||||
newp = oldp > 0 ? oldp : i2
|
||||
currentlength = (i2 - newp + 1)
|
||||
if currentlength > size
|
||||
start1, start2, size = i1 - currentlength + 1, newp, currentlength
|
||||
end
|
||||
end
|
||||
p[i2], oldp = newp, p[i2]
|
||||
end
|
||||
end
|
||||
return start1, start2, size
|
||||
end
|
||||
|
||||
function matching_blocks!(x::Set{Tuple{Int, Int, Int}}, s1::AbstractString, s2::AbstractString, start1::Integer, start2::Integer)
|
||||
a = longest_common_substring(s1, s2)
|
||||
if a[3] > 0
|
||||
push!(x, (a[1] + start1 - 1, a[2] + start2 - 1, a[3]))
|
||||
s1before = SubString(s1, start(s1), chr2ind(s1, a[1]) - 1)
|
||||
s2before = SubString(s2, start(s2), chr2ind(s2, a[2]) - 1)
|
||||
matching_blocks!(x, s1before, s2before, start1, start2)
|
||||
if (a[1] + a[3]) <= endof(s1) && (a[2] + a[3]) <= endof(s2)
|
||||
s1after = SubString(s1, chr2ind(s1, a[1] + a[3]), endof(s1))
|
||||
s2after = SubString(s2, chr2ind(s2, a[2] + a[3]), endof(s2))
|
||||
matching_blocks!(x, s1after, s2after, start1 + a[1] + a[3] - 1, start2 + a[2] + a[3] - 1)
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
function matching_blocks(s1::AbstractString, s2::AbstractString)
|
||||
x = Set{Tuple{Int, Int, Int}}()
|
||||
matching_blocks!(x, s1, s2, 1, 1)
|
||||
return x
|
||||
end
|
||||
|
||||
type RatcliffObershelp <: PreMetric end
|
||||
function evaluate(dist::RatcliffObershelp, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
|
||||
len2 == 0 && 0.0
|
||||
result = matching_blocks(s1, s2)
|
||||
matched = 0
|
||||
for x in result
|
||||
matched += x[3]
|
||||
end
|
||||
1.0 - 2 * matched / (len1 + len2)
|
||||
end
|
|
@ -24,7 +24,7 @@ end
|
|||
##
|
||||
##############################################################################
|
||||
|
||||
function evaluate(dist::Hamming, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
|
||||
function evaluate(dist::Hamming, s1::AbstractString, s2::AbstractString, len1::Integer, len2:: Integer)
|
||||
count = 0
|
||||
for (ch1, ch2) in zip(s1, s2)
|
||||
count += ch1 != ch2
|
||||
|
@ -33,8 +33,6 @@ function evaluate(dist::Hamming, s1::AbstractString, s2::AbstractString, len1::I
|
|||
return count
|
||||
end
|
||||
|
||||
hamming(s1::AbstractString, s2::AbstractString) = evaluate(Hamming(), s1, s2)
|
||||
|
||||
##############################################################################
|
||||
##
|
||||
## Levenshtein
|
||||
|
@ -83,9 +81,6 @@ function evaluate(dist::Levenshtein, s1::AbstractString, s2::AbstractString, len
|
|||
end
|
||||
return current
|
||||
end
|
||||
function levenshtein(s1::AbstractString, s2::AbstractString)
|
||||
evaluate(Levenshtein(), s1, s2)
|
||||
end
|
||||
|
||||
##############################################################################
|
||||
##
|
||||
|
@ -157,11 +152,9 @@ function evaluate(dist::DamerauLevenshtein, s1::AbstractString, s2::AbstractStri
|
|||
return current
|
||||
end
|
||||
|
||||
damerau_levenshtein(s1::AbstractString, s2::AbstractString) = evaluate(DamerauLevenshtein(), s1, s2)
|
||||
|
||||
##############################################################################
|
||||
##
|
||||
## JaroWinkler
|
||||
## Jaro
|
||||
##
|
||||
##############################################################################
|
||||
|
||||
|
@ -208,7 +201,3 @@ function evaluate(dist::Jaro, s1::AbstractString, s2::AbstractString, len1::Inte
|
|||
end
|
||||
|
||||
jaro(s1::AbstractString, s2::AbstractString) = evaluate(Jaro(), s1, s2)
|
||||
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,8 @@
|
|||
function evaluate(dist::PreMetric, s1::AbstractString, s2::AbstractString)
|
||||
len1, len2 = length(s1), length(s2)
|
||||
if len1 > len2
|
||||
return evaluate(dist, s2, s1, len2, len1)
|
||||
else
|
||||
return evaluate(dist, s1, s2, len1, len2)
|
||||
end
|
||||
end
|
|
@ -41,7 +41,7 @@ function Base.collect(qgram::QGramIterator)
|
|||
end
|
||||
return x
|
||||
end
|
||||
Base.sort(qgram::QGramIterator) = sort!(collect(qgram), alg = QuickSort)
|
||||
Base.sort(qgram::QGramIterator) = sort!(collect(qgram))
|
||||
|
||||
##############################################################################
|
||||
##
|
||||
|
@ -94,13 +94,12 @@ end
|
|||
##
|
||||
##############################################################################
|
||||
|
||||
type QGram{T <: Integer} <: AbstractQGram
|
||||
immutable QGram{T <: Integer} <: AbstractQGram
|
||||
q::T
|
||||
end
|
||||
QGram() = QGram(2)
|
||||
|
||||
function evaluate(dist::QGram, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
|
||||
len2 == 0 && return 0
|
||||
n = 0
|
||||
for (n1, n2) in PairIterator(s1, s2, len1, len2, dist.q)
|
||||
n += abs(n1 - n2)
|
||||
|
@ -119,14 +118,13 @@ end
|
|||
## 1 - v(s1, p).v(s2, p) / ||v(s1, p)|| * ||v(s2, p)||
|
||||
##############################################################################
|
||||
|
||||
type Cosine{T <: Integer} <: AbstractQGram
|
||||
immutable Cosine{T <: Integer} <: AbstractQGram
|
||||
q::T
|
||||
end
|
||||
Cosine() = Cosine(2)
|
||||
|
||||
function evaluate(dist::Cosine, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
|
||||
len2 == 0 && return 0.0
|
||||
(len1 <= (dist.q - 1)) && return convert(Float64, s1 != s2)
|
||||
len1 <= (dist.q - 1) && return convert(Float64, s1 != s2)
|
||||
norm1, norm2, prodnorm = 0, 0, 0
|
||||
for (n1, n2) in PairIterator(s1, s2, len1, len2, dist.q)
|
||||
norm1 += n1^2
|
||||
|
@ -147,18 +145,15 @@ end
|
|||
## Denote Q(s, q) the set of tuple of length q in s
|
||||
## 1 - |intersect(Q(s1, q), Q(s2, q))| / |union(Q(s1, q), Q(s2, q))|
|
||||
##
|
||||
## return 1.0 if smaller than qgram
|
||||
##
|
||||
##############################################################################
|
||||
|
||||
type Jaccard{T <: Integer} <: AbstractQGram
|
||||
immutable Jaccard{T <: Integer} <: AbstractQGram
|
||||
q::T
|
||||
end
|
||||
Jaccard() = Jaccard(2)
|
||||
|
||||
function evaluate(dist::Jaccard, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
|
||||
len2 == 0 && return 0.0
|
||||
(len1 <= (dist.q - 1)) && return convert(Float64, s1 != s2)
|
||||
len1 <= (dist.q - 1) && return convert(Float64, s1 != s2)
|
||||
ndistinct1, ndistinct2, nintersect = 0, 0, 0
|
||||
for (n1, n2) in PairIterator(s1, s2, len1, len2, dist.q)
|
||||
ndistinct1 += n1 > 0
|
|
@ -0,0 +1,36 @@
|
|||
##############################################################################
|
||||
##
|
||||
## compare
|
||||
##
|
||||
##############################################################################
|
||||
|
||||
function compare(dist::PreMetric, s1::AbstractString, s2::AbstractString)
|
||||
len1, len2 = length(s1), length(s2)
|
||||
if len1 > len2
|
||||
return compare(dist, s2, s1, len2, len1)
|
||||
else
|
||||
return compare(dist, s1, s2, len1, len2)
|
||||
end
|
||||
end
|
||||
|
||||
|
||||
|
||||
function compare(dist::PreMetric, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
|
||||
1.0 - evaluate(dist, s1, s2, len1, len2)
|
||||
end
|
||||
|
||||
function compare(dist::Union{Hamming, Levenshtein, DamerauLevenshtein}, s1::AbstractString, s2::AbstractString,
|
||||
len1::Integer, len2::Integer)
|
||||
distance = evaluate(dist, s1, s2, len1, len2)
|
||||
return len2 == 0 ? 1.0 : 1.0 - distance / len2
|
||||
end
|
||||
|
||||
function compare(dist::QGram, s1::AbstractString, s2::AbstractString,
|
||||
len1::Integer, len2::Integer)
|
||||
distance = evaluate(dist, s1, s2, len1, len2)
|
||||
if len1 <= (dist.q - 1)
|
||||
return s1 == s2 ? 1.0 : 0.0
|
||||
else
|
||||
return 1 - distance / (len1 + len2 - 2 * dist.q + 2)
|
||||
end
|
||||
end
|
|
@ -0,0 +1,43 @@
|
|||
##############################################################################
|
||||
##
|
||||
## Partial
|
||||
## From the Python module fuzzywuzzy
|
||||
## http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
|
||||
##
|
||||
##############################################################################
|
||||
type Partial{T <: PreMetric} <: PreMetric
|
||||
dist::T
|
||||
end
|
||||
|
||||
# general
|
||||
function compare(dist::Partial, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
|
||||
len1 == len2 && return compare(dist.dist, s1, s2, len1, len2)
|
||||
len1 == 0 && return compare(dist.dist, "", "", 0, 0)
|
||||
iter = QGramIterator(s2, len2, len1)
|
||||
state = start(iter)
|
||||
s, state = next(iter, state)
|
||||
out = compare(dist.dist, s1, s)
|
||||
while !done(iter, state)
|
||||
s, state = next(iter, state)
|
||||
curr = compare(dist.dist, s1, s)
|
||||
out = max(out, curr)
|
||||
end
|
||||
return out
|
||||
end
|
||||
|
||||
# Specialization for RatcliffObershelp distance
|
||||
# Code: https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/fuzz.py
|
||||
function compare(dist::Partial{RatcliffObershelp}, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
|
||||
len1 == len2 && return compare(dist.dist, s1, s2, len1, len2)
|
||||
out = 0.0
|
||||
result = matching_blocks(s1, s2)
|
||||
for r in result
|
||||
s2_start = max(1, r[2] - r[1] + 1)
|
||||
s2_end = s2_start + len1 - 1
|
||||
i2_start = chr2ind(s2, s2_start)
|
||||
i2_end = s2_end == len2 ? endof(s2) : (chr2ind(s2, s2_end + 1) - 1)
|
||||
curr = compare(RatcliffObershelp(), s1, SubString(s2, i2_start, i2_end), len1, len1)
|
||||
out = max(out, curr)
|
||||
end
|
||||
return out
|
||||
end
|
|
@ -0,0 +1,61 @@
|
|||
##############################################################################
|
||||
##
|
||||
## TokenSort
|
||||
##
|
||||
##############################################################################
|
||||
type TokenSort{T <: PreMetric} <: PreMetric
|
||||
dist::T
|
||||
end
|
||||
|
||||
function compare{T <: AbstractString}(dist::TokenSort, s1::T, s2::T, len1::Integer, len2::Integer)
|
||||
s1 = join(sort!(split(s1)), " ")
|
||||
s2 = join(sort!(split(s2)), " ")
|
||||
compare(dist.dist, s1, s2)
|
||||
end
|
||||
|
||||
##############################################################################
|
||||
##
|
||||
## TokenSet
|
||||
##
|
||||
##############################################################################
|
||||
type TokenSet{T <: PreMetric} <: PreMetric
|
||||
dist::T
|
||||
end
|
||||
|
||||
function compare{T <: AbstractString}(dist::TokenSet, s1::T, s2::T, len1::Integer, len2::Integer)
|
||||
v0, v1, v2 = _separate!(split(s1), split(s2))
|
||||
s0 = join(v0, " ")
|
||||
s1 = join(chain(v0, v1), " ")
|
||||
s2 = join(chain(v0, v2), " ")
|
||||
if isempty(s0)
|
||||
# otherwise compare(dist, "", "a")== 1.0
|
||||
compare(dist.dist, s1, s2)
|
||||
else
|
||||
max(compare(dist.dist, s0, s1),
|
||||
compare(dist.dist, s1, s2),
|
||||
compare(dist.dist, s0, s2))
|
||||
end
|
||||
end
|
||||
|
||||
# separate 2 vectors in intersection, setdiff1, setdiff2 (all sorted)
|
||||
function _separate!(v1::Vector, v2::Vector)
|
||||
sort!(v1)
|
||||
sort!(v2)
|
||||
out = eltype(v1)[]
|
||||
start = 1
|
||||
i1 = 0
|
||||
while i1 < length(v1)
|
||||
i1 += 1
|
||||
x = v1[i1]
|
||||
i2 = searchsortedfirst(v2, x, start, length(v2), Base.Forward)
|
||||
i2 > length(v2) && break
|
||||
if i2 > 0 && v2[i2] == x
|
||||
deleteat!(v1, i1)
|
||||
deleteat!(v2, i2)
|
||||
push!(out, x)
|
||||
i1 -= 1
|
||||
start = i2
|
||||
end
|
||||
end
|
||||
return out, v1, v2
|
||||
end
|
|
@ -7,18 +7,18 @@
|
|||
type Winkler{T1 <: PreMetric, T2 <: Real, T3 <: Real} <: PreMetric
|
||||
dist::T1
|
||||
scaling_factor::T2 # scaling factor. Default to 0.1
|
||||
boosting_limit::T3 # boost threshold. Default to 1.0
|
||||
boosting_limit::T3 # boost threshold. Default to 0.7
|
||||
end
|
||||
|
||||
# restrict to distance between 0 and 1
|
||||
Winkler(x) = Winkler(x, 0.1, 1.0)
|
||||
Winkler(x) = Winkler(x, 0.1, 0.7)
|
||||
|
||||
function evaluate(dist::Winkler, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
|
||||
distance = evaluate(Normalized(dist.dist), s1, s2, len1, len2)
|
||||
function compare(dist::Winkler, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
|
||||
score = compare(dist.dist, s1, s2, len1, len2)
|
||||
l = common_prefix(s1, s2, 4)[1]
|
||||
# common prefix adjustment
|
||||
if distance <= dist.boosting_limit
|
||||
distance -= distance * l * dist.scaling_factor
|
||||
if score >= dist.boosting_limit
|
||||
score += l * dist.scaling_factor * (1 - score)
|
||||
end
|
||||
return distance
|
||||
return score
|
||||
end
|
|
@ -1,30 +0,0 @@
|
|||
##############################################################################
|
||||
##
|
||||
## Normalized
|
||||
##
|
||||
##############################################################################
|
||||
|
||||
type Normalized{T <: PreMetric} <: PreMetric
|
||||
dist::T
|
||||
end
|
||||
|
||||
function evaluate(normalized::Normalized, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
|
||||
evaluate(normalized.dist, s1, s2, len1, len2)
|
||||
end
|
||||
|
||||
function evaluate{T <: Union{Hamming, Levenshtein, DamerauLevenshtein}}(
|
||||
normalized::Normalized{T}, s1::AbstractString, s2::AbstractString,
|
||||
len1::Integer, len2::Integer)
|
||||
distance = evaluate(normalized.dist, s1, s2, len1, len2)
|
||||
return distance / len2
|
||||
end
|
||||
|
||||
function evaluate{T <: QGram}(normalized::Normalized{T}, s1::AbstractString, s2::AbstractString,
|
||||
len1::Integer, len2::Integer)
|
||||
distance = evaluate(normalized.dist, s1, s2, len1, len2)
|
||||
if len1 <= (normalized.dist.q - 1)
|
||||
return s1 == s2 ? 0.0 : 1.0
|
||||
else
|
||||
return distance / (len1 + len2 - 2 * normalized.dist.q + 2)
|
||||
end
|
||||
end
|
|
@ -2,15 +2,6 @@
|
|||
using StringDistances, Base.Test
|
||||
|
||||
|
||||
@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "martha", "marhta") 1 - 0.9611 1e-4
|
||||
@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "dwayne", "duane") 1 - 0.84 1e-4
|
||||
@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "dixon", "dicksonx") 1 - 0.81333 1e-4
|
||||
@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "william", "williams") 1 - 0.975 1e-4
|
||||
@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "", "foo") 1.0 1e-4
|
||||
@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "a", "a") 0.0 1e-4
|
||||
@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "abc", "xyz") 1.0 1e-4
|
||||
|
||||
|
||||
@test evaluate(Levenshtein(), "", "") == 0
|
||||
@test evaluate(Levenshtein(), "abc", "") == 3
|
||||
@test evaluate(Levenshtein(), "", "abc") == 3
|
||||
|
@ -43,24 +34,10 @@ using StringDistances, Base.Test
|
|||
@test evaluate(Hamming(), "saturday", "sunday") == 7
|
||||
|
||||
|
||||
|
||||
@test_approx_eq_eps evaluate(Normalized(Hamming()), "", "abc") 1.0 1e-4
|
||||
@test_approx_eq_eps evaluate(Normalized(Hamming()), "acc", "abc") 1/3 1e-4
|
||||
@test_approx_eq_eps evaluate(Normalized(Hamming()), "saturday", "sunday") 7/8 1e-4
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
@test evaluate(QGram(1), "", "abc") == 3
|
||||
@test evaluate(QGram(1), "abc", "cba") == 0
|
||||
@test evaluate(QGram(1), "abc", "ccc") == 4
|
||||
|
||||
@test_approx_eq_eps evaluate(Normalized(QGram(1)), "", "abc") 1.0 1e-4
|
||||
@test_approx_eq_eps evaluate(Normalized(QGram(1)), "abc", "cba") 0.0 1e-4
|
||||
@test_approx_eq_eps evaluate(Normalized(QGram(1)), "abc", "ccc") 2/3 1e-4
|
||||
|
||||
@test_approx_eq_eps evaluate(Cosine(2), "", "abc") 1 1e-4
|
||||
@test_approx_eq_eps evaluate(Cosine(2), "abc", "ccc") 1 1e-4
|
||||
@test_approx_eq_eps evaluate(Cosine(2), "leia", "leela") 0.7113249 1e-4
|
||||
|
@ -71,17 +48,6 @@ using StringDistances, Base.Test
|
|||
@test_approx_eq_eps evaluate(Jaccard(2), "leia", "leela") 0.83333 1e-4
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
strings = [
|
||||
("martha", "marhta"),
|
||||
("dwayne", "duane") ,
|
||||
|
@ -106,7 +72,6 @@ strings = [
|
|||
for x in ((Levenshtein(), [2 2 4 1 3 0 3 2 3 3 4 6 17 3 3 2]),
|
||||
(DamerauLevenshtein(), [1 2 4 1 3 0 3 2 3 3 4 6 17 2 2 2]),
|
||||
(Jaro(), [0.05555556 0.17777778 0.23333333 0.04166667 1.00000000 0.00000000 1.00000000 0.44444444 0.25396825 0.24722222 0.16190476 0.48809524 0.49166667 0.07407407 0.16666667 0.21666667]),
|
||||
(Winkler(Jaro(), 0.1, 1.0), [0.03888889 0.16000000 0.18666667 0.02500000 1.00000000 0.00000000 1.00000000 0.44444444 0.25396825 0.22250000 0.16190476 0.43928571 0.49166667 0.04444444 0.16666667 0.17333333]),
|
||||
(QGram(1), [0 3 3 1 3 0 6 4 5 4 4 11 14 0 0 3]),
|
||||
(QGram(2), [ 6 7 7 1 2 0 4 4 7 8 4 13 32 8 6 5]),
|
||||
(Jaccard(1), [0.0000000 0.4285714 0.3750000 0.1666667 1.0 0.0000000 1.0000000 0.6666667 0.5714286 0.3750000 0.2000000 0.8333333 0.5000000 0.0000000 0.0000000 0.2500000]),
|
||||
|
@ -145,3 +110,28 @@ stringdist(strings[1,], strings[2,], method = "jw", p = 0.1)
|
|||
stringdist(strings[1,], strings[2,], method = "qgram", q = 1)
|
||||
|
||||
=#
|
||||
|
||||
|
||||
|
||||
Set([(1,1,3)
|
||||
(4,5,1)
|
||||
(6,6,1)
|
||||
])
|
||||
@test matching_blocks("dwayne", "duane") ==
|
||||
Set([(5,4,2)
|
||||
(1,1,1)
|
||||
(3,3,1)])
|
||||
@test matching_blocks("dixon", "dicksonx") ==
|
||||
Set([(1,1,2)
|
||||
(4,6,2)
|
||||
])
|
||||
|
||||
|
||||
@test_approx_eq evaluate(RatcliffObershelp(), "dixon", "dicksonx") 1 - 0.6153846153846154
|
||||
@test_approx_eq evaluate(RatcliffObershelp(), "alexandre", "aleksander") 1 - 0.7368421052631579
|
||||
@test_approx_eq evaluate(RatcliffObershelp(), "pennsylvania", "pencilvaneya") 1 - 0.6666666666666
|
||||
@test_approx_eq evaluate(RatcliffObershelp(), "", "pencilvaneya") 1.0
|
||||
@test_approx_eq evaluate(RatcliffObershelp(),"NEW YORK METS", "NEW YORK MEATS") 1 - 0.962962962963
|
||||
@test_approx_eq evaluate(RatcliffObershelp(), "Yankees", "New York Yankees") 0.3913043478260869
|
||||
@test_approx_eq evaluate(RatcliffObershelp(), "New York Mets", "New York Yankees") 0.24137931034482762
|
||||
|
||||
|
|
|
@ -0,0 +1,62 @@
|
|||
|
||||
using StringDistances, Base.Test
|
||||
|
||||
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "martha", "marhta") 0.9611 1e-4
|
||||
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "dwayne", "duane") 0.84 1e-4
|
||||
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "dixon", "dicksonx") 0.81333 1e-4
|
||||
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "william", "williams") 0.975 1e-4
|
||||
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "", "foo") 0.0 1e-4
|
||||
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "a", "a") 1.0 1e-4
|
||||
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "abc", "xyz") 0.0 1e-4
|
||||
|
||||
strings = [
|
||||
("martha", "marhta"),
|
||||
("dwayne", "duane") ,
|
||||
("dixon", "dicksonx"),
|
||||
("william", "williams"),
|
||||
("", "foo"),
|
||||
("a", "a"),
|
||||
("abc", "xyz"),
|
||||
("abc", "ccc"),
|
||||
("kitten", "sitting"),
|
||||
("saturday", "sunday"),
|
||||
("hi, my name is", "my name is"),
|
||||
("alborgów", "amoniak"),
|
||||
("cape sand recycling ", "edith ann graham"),
|
||||
( "jellyifhs", "jellyfish"),
|
||||
("ifhs", "fish"),
|
||||
("leia", "leela"),
|
||||
]
|
||||
solutions = [0.03888889 0.16000000 0.18666667 0.02500000 1.00000000 0.00000000 1.00000000 0.44444444 0.25396825 0.22250000 0.16190476 0.43928571 0.49166667 0.04444444 0.16666667 0.17333333]
|
||||
for i in 1:length(solutions)
|
||||
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), strings[i]...) (1 - solutions[i]) 1e-4
|
||||
end
|
||||
|
||||
|
||||
|
||||
|
||||
@test_approx_eq_eps compare(Hamming(), "", "abc") 0.0 1e-4
|
||||
@test_approx_eq_eps compare(Hamming(), "acc", "abc") 2/3 1e-4
|
||||
@test_approx_eq_eps compare(Hamming(), "saturday", "sunday") 1/8 1e-4
|
||||
|
||||
@test_approx_eq_eps compare(QGram(1), "", "abc") 0.0 1e-4
|
||||
@test_approx_eq_eps compare(QGram(1), "abc", "cba") 1.0 1e-4
|
||||
@test_approx_eq_eps compare(QGram(1), "abc", "ccc") 1/3 1e-4
|
||||
|
||||
|
||||
@test_approx_eq compare(Partial(RatcliffObershelp()), "New York Yankees", "Yankees") 1.0
|
||||
@test_approx_eq compare(Partial(RatcliffObershelp()), "New York Yankees", "") 0.0
|
||||
|
||||
|
||||
@test_approx_eq compare(Partial(Hamming()), "New York Yankees", "Yankees") 1
|
||||
@test_approx_eq compare(Partial(Hamming()), "New York Yankees", "") 1
|
||||
|
||||
|
||||
|
||||
|
||||
@test_approx_eq compare(TokenSort(RatcliffObershelp()), "New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") 1.0
|
||||
@test_approx_eq compare(TokenSet(RatcliffObershelp()),"mariners vs angels", "los angeles angels of anaheim at seattle mariners") 1.0 - 0.09090909090909094
|
||||
|
||||
|
||||
@test_approx_eq compare(TokenSort(RatcliffObershelp()), "New York Mets vs Atlanta Braves", "") 0.0
|
||||
@test_approx_eq compare(TokenSet(RatcliffObershelp()),"mariners vs angels", "") 0.0
|
|
@ -1,7 +1,6 @@
|
|||
using StringDistances
|
||||
|
||||
tests = ["distances.jl"
|
||||
]
|
||||
tests = ["distances.jl", "modifiers.jl"]
|
||||
|
||||
println("Running tests:")
|
||||
|
||||
|
|
Loading…
Reference in New Issue