add RatcliffObershelp

pull/3/head
matthieugomez 2015-11-04 12:40:30 -05:00
parent 99b997c9e2
commit aa4c75a340
16 changed files with 408 additions and 141 deletions

View File

@ -2,43 +2,95 @@
[![Coverage Status](https://coveralls.io/repos/matthieugomez/StringDistances.jl/badge.svg?branch=master)](https://coveralls.io/r/matthieugomez/StringDistances.jl?branch=master)
[![StringDistances](http://pkg.julialang.org/badges/StringDistances_0.4.svg)](http://pkg.julialang.org/?pkg=StringDistances)
StringDistances allow to compute various distances between strings. The package should work with any `AbstractString` (in particular ASCII and UTF-8)
This Julia package computes various distances between strings.
## Distances
#### Edit Distances
- Hamming Distance
- Jaro Distance
- Levenshtein Distance
- Damerau-Levenshtein Distance
- [RatcliffObershelp Distance](https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html) (similar to the Python library [difflib](https://docs.python.org/2/library/difflib.html))
#### Q-Grams Distances
- QGram Distance
- Cosine Distance
- Jaccard Distance
A good reference about string distances is the article written for the R package `stringdist`:
A good reference for q-gram distances is the article written for the R package `stringdist`:
*The stringdist Package for Approximate String Matching* Mark P.J. van der Loo
## Syntax
- The basic syntax follows the [Distances](https://github.com/JuliaStats/Distances.jl) package:
#### evaluate
The function `evaluate` returns the litteral distance between two strings (a value of 0 being identical). While some distances are bounded by 1, other distances like `Hamming`, `Levenshtein`, `Damerau-Levenshtein`, `Jaccard` can be higher than 1.
```julia
using StringDistances
evaluate(Hamming(), "martha", "marhta")
#> 2
evaluate(QGram(2), "martha", "marhta")
#> 6
```
#### compare
The higher level function `compare` directly computes for any distance a similarity score between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.
```julia
using StringDistances
compare(Hamming(), "martha", "marhta")
#> 0.6666666666666667
compare(QGram(2), "martha", "marhta")
#> 0.4
```
## Modifiers
The package defines a number of types to modify string metrics:
- [Winkler](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) boosts the similary score of strings with common prefixes
```julia
using StringDistances
evaluate(Hamming(), "martha", "marhta")
evaluate(QGram(2), "martha", "marhta")
compare(Jaro(), "martha", "marhta")
#> 0.9444444444444445
compare(Winkler(Jaro()), "martha", "marhta")
#> 0.9611111111111111
```
- Normalize a distance between 0-1 with `Normalized`
The Winkler adjustment was originally defined for the Jaro distance but this package defines it for any string distance.
```julia
evaluate(Normalized(Hamming()), "martha", "marhta")
evaluate(Normalized(QGram(2)), "martha", "marhta")
compare(QGram(2), "william", "williams")
#> 0.9230769230769231
compare(Winkler(QGram(2)), "william", "williams")
#> 0.9538461538461539
```
- Add a [Winkler adjustment](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) with `Winkler`
- For strings composed of several words, the Python library [fuzzywuzzy](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) defines a few modifiers for the `RatcliffObershelp` distance. This package defines them for any string distance:
- [Partial](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in string lengths. The function returns the maximal similarity score between the shorter string and all substrings of the longer string.
```julia
compare(Partial(Hamming()), "New York Yankees", "Yankees")
#> 1.0
```
- [TokenSort](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders by reording words alphabetically.
```julia
compare(TokenSort(RatcliffObershelp()),"mariners vs angels", "angels vs mariners")
#> 1.0
```
- [TokenSet](http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/) adjusts for differences in word orders and word numbers.
```julia
compare(TokenSet(RatcliffObershelp()),"mariners vs angels", "los angeles angels of anaheim at seattle mariners")
```
```julia
evaluate(Winkler(Jaro()), "martha", "marhta")
evaluate(Winkler(Qgram(2)), "martha", "marhta")
```
While the Winkler adjustment was originally defined in the context of the Jaro distance, it can be helpful with other distances too. Note: a distance is automatically normalized between 0 and 1 when used with a Winkler adjustment.

View File

@ -1,2 +1,3 @@
julia 0.4
Distances
Distances
Iterators

View File

@ -19,6 +19,9 @@ end
@time f(Float64, Cosine(2), x, y)
@time f(Float64, Jaccard(2), x, y)
#
@time f(Float64, RatcliffObershelp(), x, y)

View File

@ -9,33 +9,34 @@ module StringDistances
##############################################################################
import Distances: evaluate, Hamming, hamming, PreMetric, SemiMetric
export evaluate,
Hamming, hamming,
Levenshtein, levenshtein,
DamerauLevenshtein, damerau_levenshtein,
Jaro, jaro,
QGram, qgram,
Cosine, cosine,
Jaccard, jaccard,
Normalized,
Winkler
import Iterators: chain
export
evaluate,
compare,
Hamming,
Levenshtein,
DamerauLevenshtein,
Jaro,
QGram,
Cosine,
Jaccard,
longest_common_substring,
matching_blocks,
RatcliffObershelp,
Winkler,
Partial,
TokenSort,
TokenSet
include("distances/evaluate.jl")
include("distances/edit.jl")
include("distances/qgram.jl")
include("distances/RatcliffObershelp.jl")
# 1. only do the switch once
# 2. precomputes length(s1), length(s2)
function evaluate(dist::PreMetric, s1::AbstractString, s2::AbstractString, x...)
len1, len2 = length(s1), length(s2)
if len1 > len2
return evaluate(dist, s2, s1, len2, len1, x...)
else
return evaluate(dist, s1, s2, len1, len2, x...)
end
end
include("edit.jl")
include("qgram.jl")
include("normalized.jl")
include("winkler.jl")
include("modifiers/compare.jl")
include("modifiers/winkler.jl")
include("modifiers/tokenize.jl")
include("modifiers/partial.jl")
end

View File

@ -0,0 +1,57 @@
# Return a character index, not a byte index
function longest_common_substring(s1::AbstractString, s2::AbstractString)
len2 = length(s2)
start1, start2, size = 0, 0, 0
p = zeros(Int, len2)
i1 = 0
for ch1 in s1
i1 += 1
i2 = 0
oldp = 0
for ch2 in s2
i2 += 1
newp = 0
if ch1 == ch2
newp = oldp > 0 ? oldp : i2
currentlength = (i2 - newp + 1)
if currentlength > size
start1, start2, size = i1 - currentlength + 1, newp, currentlength
end
end
p[i2], oldp = newp, p[i2]
end
end
return start1, start2, size
end
function matching_blocks!(x::Set{Tuple{Int, Int, Int}}, s1::AbstractString, s2::AbstractString, start1::Integer, start2::Integer)
a = longest_common_substring(s1, s2)
if a[3] > 0
push!(x, (a[1] + start1 - 1, a[2] + start2 - 1, a[3]))
s1before = SubString(s1, start(s1), chr2ind(s1, a[1]) - 1)
s2before = SubString(s2, start(s2), chr2ind(s2, a[2]) - 1)
matching_blocks!(x, s1before, s2before, start1, start2)
if (a[1] + a[3]) <= endof(s1) && (a[2] + a[3]) <= endof(s2)
s1after = SubString(s1, chr2ind(s1, a[1] + a[3]), endof(s1))
s2after = SubString(s2, chr2ind(s2, a[2] + a[3]), endof(s2))
matching_blocks!(x, s1after, s2after, start1 + a[1] + a[3] - 1, start2 + a[2] + a[3] - 1)
end
end
end
function matching_blocks(s1::AbstractString, s2::AbstractString)
x = Set{Tuple{Int, Int, Int}}()
matching_blocks!(x, s1, s2, 1, 1)
return x
end
type RatcliffObershelp <: PreMetric end
function evaluate(dist::RatcliffObershelp, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
len2 == 0 && 0.0
result = matching_blocks(s1, s2)
matched = 0
for x in result
matched += x[3]
end
1.0 - 2 * matched / (len1 + len2)
end

View File

@ -24,7 +24,7 @@ end
##
##############################################################################
function evaluate(dist::Hamming, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
function evaluate(dist::Hamming, s1::AbstractString, s2::AbstractString, len1::Integer, len2:: Integer)
count = 0
for (ch1, ch2) in zip(s1, s2)
count += ch1 != ch2
@ -33,8 +33,6 @@ function evaluate(dist::Hamming, s1::AbstractString, s2::AbstractString, len1::I
return count
end
hamming(s1::AbstractString, s2::AbstractString) = evaluate(Hamming(), s1, s2)
##############################################################################
##
## Levenshtein
@ -83,9 +81,6 @@ function evaluate(dist::Levenshtein, s1::AbstractString, s2::AbstractString, len
end
return current
end
function levenshtein(s1::AbstractString, s2::AbstractString)
evaluate(Levenshtein(), s1, s2)
end
##############################################################################
##
@ -157,11 +152,9 @@ function evaluate(dist::DamerauLevenshtein, s1::AbstractString, s2::AbstractStri
return current
end
damerau_levenshtein(s1::AbstractString, s2::AbstractString) = evaluate(DamerauLevenshtein(), s1, s2)
##############################################################################
##
## JaroWinkler
## Jaro
##
##############################################################################
@ -208,7 +201,3 @@ function evaluate(dist::Jaro, s1::AbstractString, s2::AbstractString, len1::Inte
end
jaro(s1::AbstractString, s2::AbstractString) = evaluate(Jaro(), s1, s2)

View File

@ -0,0 +1,8 @@
function evaluate(dist::PreMetric, s1::AbstractString, s2::AbstractString)
len1, len2 = length(s1), length(s2)
if len1 > len2
return evaluate(dist, s2, s1, len2, len1)
else
return evaluate(dist, s1, s2, len1, len2)
end
end

View File

@ -41,7 +41,7 @@ function Base.collect(qgram::QGramIterator)
end
return x
end
Base.sort(qgram::QGramIterator) = sort!(collect(qgram), alg = QuickSort)
Base.sort(qgram::QGramIterator) = sort!(collect(qgram))
##############################################################################
##
@ -94,13 +94,12 @@ end
##
##############################################################################
type QGram{T <: Integer} <: AbstractQGram
immutable QGram{T <: Integer} <: AbstractQGram
q::T
end
QGram() = QGram(2)
function evaluate(dist::QGram, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
len2 == 0 && return 0
n = 0
for (n1, n2) in PairIterator(s1, s2, len1, len2, dist.q)
n += abs(n1 - n2)
@ -119,14 +118,13 @@ end
## 1 - v(s1, p).v(s2, p) / ||v(s1, p)|| * ||v(s2, p)||
##############################################################################
type Cosine{T <: Integer} <: AbstractQGram
immutable Cosine{T <: Integer} <: AbstractQGram
q::T
end
Cosine() = Cosine(2)
function evaluate(dist::Cosine, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
len2 == 0 && return 0.0
(len1 <= (dist.q - 1)) && return convert(Float64, s1 != s2)
len1 <= (dist.q - 1) && return convert(Float64, s1 != s2)
norm1, norm2, prodnorm = 0, 0, 0
for (n1, n2) in PairIterator(s1, s2, len1, len2, dist.q)
norm1 += n1^2
@ -147,18 +145,15 @@ end
## Denote Q(s, q) the set of tuple of length q in s
## 1 - |intersect(Q(s1, q), Q(s2, q))| / |union(Q(s1, q), Q(s2, q))|
##
## return 1.0 if smaller than qgram
##
##############################################################################
type Jaccard{T <: Integer} <: AbstractQGram
immutable Jaccard{T <: Integer} <: AbstractQGram
q::T
end
Jaccard() = Jaccard(2)
function evaluate(dist::Jaccard, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
len2 == 0 && return 0.0
(len1 <= (dist.q - 1)) && return convert(Float64, s1 != s2)
len1 <= (dist.q - 1) && return convert(Float64, s1 != s2)
ndistinct1, ndistinct2, nintersect = 0, 0, 0
for (n1, n2) in PairIterator(s1, s2, len1, len2, dist.q)
ndistinct1 += n1 > 0

36
src/modifiers/compare.jl Normal file
View File

@ -0,0 +1,36 @@
##############################################################################
##
## compare
##
##############################################################################
function compare(dist::PreMetric, s1::AbstractString, s2::AbstractString)
len1, len2 = length(s1), length(s2)
if len1 > len2
return compare(dist, s2, s1, len2, len1)
else
return compare(dist, s1, s2, len1, len2)
end
end
function compare(dist::PreMetric, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
1.0 - evaluate(dist, s1, s2, len1, len2)
end
function compare(dist::Union{Hamming, Levenshtein, DamerauLevenshtein}, s1::AbstractString, s2::AbstractString,
len1::Integer, len2::Integer)
distance = evaluate(dist, s1, s2, len1, len2)
return len2 == 0 ? 1.0 : 1.0 - distance / len2
end
function compare(dist::QGram, s1::AbstractString, s2::AbstractString,
len1::Integer, len2::Integer)
distance = evaluate(dist, s1, s2, len1, len2)
if len1 <= (dist.q - 1)
return s1 == s2 ? 1.0 : 0.0
else
return 1 - distance / (len1 + len2 - 2 * dist.q + 2)
end
end

43
src/modifiers/partial.jl Normal file
View File

@ -0,0 +1,43 @@
##############################################################################
##
## Partial
## From the Python module fuzzywuzzy
## http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
##
##############################################################################
type Partial{T <: PreMetric} <: PreMetric
dist::T
end
# general
function compare(dist::Partial, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
len1 == len2 && return compare(dist.dist, s1, s2, len1, len2)
len1 == 0 && return compare(dist.dist, "", "", 0, 0)
iter = QGramIterator(s2, len2, len1)
state = start(iter)
s, state = next(iter, state)
out = compare(dist.dist, s1, s)
while !done(iter, state)
s, state = next(iter, state)
curr = compare(dist.dist, s1, s)
out = max(out, curr)
end
return out
end
# Specialization for RatcliffObershelp distance
# Code: https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/fuzz.py
function compare(dist::Partial{RatcliffObershelp}, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
len1 == len2 && return compare(dist.dist, s1, s2, len1, len2)
out = 0.0
result = matching_blocks(s1, s2)
for r in result
s2_start = max(1, r[2] - r[1] + 1)
s2_end = s2_start + len1 - 1
i2_start = chr2ind(s2, s2_start)
i2_end = s2_end == len2 ? endof(s2) : (chr2ind(s2, s2_end + 1) - 1)
curr = compare(RatcliffObershelp(), s1, SubString(s2, i2_start, i2_end), len1, len1)
out = max(out, curr)
end
return out
end

61
src/modifiers/tokenize.jl Normal file
View File

@ -0,0 +1,61 @@
##############################################################################
##
## TokenSort
##
##############################################################################
type TokenSort{T <: PreMetric} <: PreMetric
dist::T
end
function compare{T <: AbstractString}(dist::TokenSort, s1::T, s2::T, len1::Integer, len2::Integer)
s1 = join(sort!(split(s1)), " ")
s2 = join(sort!(split(s2)), " ")
compare(dist.dist, s1, s2)
end
##############################################################################
##
## TokenSet
##
##############################################################################
type TokenSet{T <: PreMetric} <: PreMetric
dist::T
end
function compare{T <: AbstractString}(dist::TokenSet, s1::T, s2::T, len1::Integer, len2::Integer)
v0, v1, v2 = _separate!(split(s1), split(s2))
s0 = join(v0, " ")
s1 = join(chain(v0, v1), " ")
s2 = join(chain(v0, v2), " ")
if isempty(s0)
# otherwise compare(dist, "", "a")== 1.0
compare(dist.dist, s1, s2)
else
max(compare(dist.dist, s0, s1),
compare(dist.dist, s1, s2),
compare(dist.dist, s0, s2))
end
end
# separate 2 vectors in intersection, setdiff1, setdiff2 (all sorted)
function _separate!(v1::Vector, v2::Vector)
sort!(v1)
sort!(v2)
out = eltype(v1)[]
start = 1
i1 = 0
while i1 < length(v1)
i1 += 1
x = v1[i1]
i2 = searchsortedfirst(v2, x, start, length(v2), Base.Forward)
i2 > length(v2) && break
if i2 > 0 && v2[i2] == x
deleteat!(v1, i1)
deleteat!(v2, i2)
push!(out, x)
i1 -= 1
start = i2
end
end
return out, v1, v2
end

View File

@ -7,18 +7,18 @@
type Winkler{T1 <: PreMetric, T2 <: Real, T3 <: Real} <: PreMetric
dist::T1
scaling_factor::T2 # scaling factor. Default to 0.1
boosting_limit::T3 # boost threshold. Default to 1.0
boosting_limit::T3 # boost threshold. Default to 0.7
end
# restrict to distance between 0 and 1
Winkler(x) = Winkler(x, 0.1, 1.0)
Winkler(x) = Winkler(x, 0.1, 0.7)
function evaluate(dist::Winkler, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
distance = evaluate(Normalized(dist.dist), s1, s2, len1, len2)
function compare(dist::Winkler, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
score = compare(dist.dist, s1, s2, len1, len2)
l = common_prefix(s1, s2, 4)[1]
# common prefix adjustment
if distance <= dist.boosting_limit
distance -= distance * l * dist.scaling_factor
if score >= dist.boosting_limit
score += l * dist.scaling_factor * (1 - score)
end
return distance
return score
end

View File

@ -1,30 +0,0 @@
##############################################################################
##
## Normalized
##
##############################################################################
type Normalized{T <: PreMetric} <: PreMetric
dist::T
end
function evaluate(normalized::Normalized, s1::AbstractString, s2::AbstractString, len1::Integer, len2::Integer)
evaluate(normalized.dist, s1, s2, len1, len2)
end
function evaluate{T <: Union{Hamming, Levenshtein, DamerauLevenshtein}}(
normalized::Normalized{T}, s1::AbstractString, s2::AbstractString,
len1::Integer, len2::Integer)
distance = evaluate(normalized.dist, s1, s2, len1, len2)
return distance / len2
end
function evaluate{T <: QGram}(normalized::Normalized{T}, s1::AbstractString, s2::AbstractString,
len1::Integer, len2::Integer)
distance = evaluate(normalized.dist, s1, s2, len1, len2)
if len1 <= (normalized.dist.q - 1)
return s1 == s2 ? 0.0 : 1.0
else
return distance / (len1 + len2 - 2 * normalized.dist.q + 2)
end
end

View File

@ -2,15 +2,6 @@
using StringDistances, Base.Test
@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "martha", "marhta") 1 - 0.9611 1e-4
@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "dwayne", "duane") 1 - 0.84 1e-4
@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "dixon", "dicksonx") 1 - 0.81333 1e-4
@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "william", "williams") 1 - 0.975 1e-4
@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "", "foo") 1.0 1e-4
@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "a", "a") 0.0 1e-4
@test_approx_eq_eps evaluate(Winkler(Jaro(), 0.1, 1.0), "abc", "xyz") 1.0 1e-4
@test evaluate(Levenshtein(), "", "") == 0
@test evaluate(Levenshtein(), "abc", "") == 3
@test evaluate(Levenshtein(), "", "abc") == 3
@ -43,24 +34,10 @@ using StringDistances, Base.Test
@test evaluate(Hamming(), "saturday", "sunday") == 7
@test_approx_eq_eps evaluate(Normalized(Hamming()), "", "abc") 1.0 1e-4
@test_approx_eq_eps evaluate(Normalized(Hamming()), "acc", "abc") 1/3 1e-4
@test_approx_eq_eps evaluate(Normalized(Hamming()), "saturday", "sunday") 7/8 1e-4
@test evaluate(QGram(1), "", "abc") == 3
@test evaluate(QGram(1), "abc", "cba") == 0
@test evaluate(QGram(1), "abc", "ccc") == 4
@test_approx_eq_eps evaluate(Normalized(QGram(1)), "", "abc") 1.0 1e-4
@test_approx_eq_eps evaluate(Normalized(QGram(1)), "abc", "cba") 0.0 1e-4
@test_approx_eq_eps evaluate(Normalized(QGram(1)), "abc", "ccc") 2/3 1e-4
@test_approx_eq_eps evaluate(Cosine(2), "", "abc") 1 1e-4
@test_approx_eq_eps evaluate(Cosine(2), "abc", "ccc") 1 1e-4
@test_approx_eq_eps evaluate(Cosine(2), "leia", "leela") 0.7113249 1e-4
@ -71,17 +48,6 @@ using StringDistances, Base.Test
@test_approx_eq_eps evaluate(Jaccard(2), "leia", "leela") 0.83333 1e-4
strings = [
("martha", "marhta"),
("dwayne", "duane") ,
@ -106,7 +72,6 @@ strings = [
for x in ((Levenshtein(), [2 2 4 1 3 0 3 2 3 3 4 6 17 3 3 2]),
(DamerauLevenshtein(), [1 2 4 1 3 0 3 2 3 3 4 6 17 2 2 2]),
(Jaro(), [0.05555556 0.17777778 0.23333333 0.04166667 1.00000000 0.00000000 1.00000000 0.44444444 0.25396825 0.24722222 0.16190476 0.48809524 0.49166667 0.07407407 0.16666667 0.21666667]),
(Winkler(Jaro(), 0.1, 1.0), [0.03888889 0.16000000 0.18666667 0.02500000 1.00000000 0.00000000 1.00000000 0.44444444 0.25396825 0.22250000 0.16190476 0.43928571 0.49166667 0.04444444 0.16666667 0.17333333]),
(QGram(1), [0 3 3 1 3 0 6 4 5 4 4 11 14 0 0 3]),
(QGram(2), [ 6 7 7 1 2 0 4 4 7 8 4 13 32 8 6 5]),
(Jaccard(1), [0.0000000 0.4285714 0.3750000 0.1666667 1.0 0.0000000 1.0000000 0.6666667 0.5714286 0.3750000 0.2000000 0.8333333 0.5000000 0.0000000 0.0000000 0.2500000]),
@ -145,3 +110,28 @@ stringdist(strings[1,], strings[2,], method = "jw", p = 0.1)
stringdist(strings[1,], strings[2,], method = "qgram", q = 1)
=#
Set([(1,1,3)
(4,5,1)
(6,6,1)
])
@test matching_blocks("dwayne", "duane") ==
Set([(5,4,2)
(1,1,1)
(3,3,1)])
@test matching_blocks("dixon", "dicksonx") ==
Set([(1,1,2)
(4,6,2)
])
@test_approx_eq evaluate(RatcliffObershelp(), "dixon", "dicksonx") 1 - 0.6153846153846154
@test_approx_eq evaluate(RatcliffObershelp(), "alexandre", "aleksander") 1 - 0.7368421052631579
@test_approx_eq evaluate(RatcliffObershelp(), "pennsylvania", "pencilvaneya") 1 - 0.6666666666666
@test_approx_eq evaluate(RatcliffObershelp(), "", "pencilvaneya") 1.0
@test_approx_eq evaluate(RatcliffObershelp(),"NEW YORK METS", "NEW YORK MEATS") 1 - 0.962962962963
@test_approx_eq evaluate(RatcliffObershelp(), "Yankees", "New York Yankees") 0.3913043478260869
@test_approx_eq evaluate(RatcliffObershelp(), "New York Mets", "New York Yankees") 0.24137931034482762

62
test/modifiers.jl Normal file
View File

@ -0,0 +1,62 @@
using StringDistances, Base.Test
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "martha", "marhta") 0.9611 1e-4
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "dwayne", "duane") 0.84 1e-4
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "dixon", "dicksonx") 0.81333 1e-4
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "william", "williams") 0.975 1e-4
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "", "foo") 0.0 1e-4
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "a", "a") 1.0 1e-4
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), "abc", "xyz") 0.0 1e-4
strings = [
("martha", "marhta"),
("dwayne", "duane") ,
("dixon", "dicksonx"),
("william", "williams"),
("", "foo"),
("a", "a"),
("abc", "xyz"),
("abc", "ccc"),
("kitten", "sitting"),
("saturday", "sunday"),
("hi, my name is", "my name is"),
("alborgów", "amoniak"),
("cape sand recycling ", "edith ann graham"),
( "jellyifhs", "jellyfish"),
("ifhs", "fish"),
("leia", "leela"),
]
solutions = [0.03888889 0.16000000 0.18666667 0.02500000 1.00000000 0.00000000 1.00000000 0.44444444 0.25396825 0.22250000 0.16190476 0.43928571 0.49166667 0.04444444 0.16666667 0.17333333]
for i in 1:length(solutions)
@test_approx_eq_eps compare(Winkler(Jaro(), 0.1, 0.0), strings[i]...) (1 - solutions[i]) 1e-4
end
@test_approx_eq_eps compare(Hamming(), "", "abc") 0.0 1e-4
@test_approx_eq_eps compare(Hamming(), "acc", "abc") 2/3 1e-4
@test_approx_eq_eps compare(Hamming(), "saturday", "sunday") 1/8 1e-4
@test_approx_eq_eps compare(QGram(1), "", "abc") 0.0 1e-4
@test_approx_eq_eps compare(QGram(1), "abc", "cba") 1.0 1e-4
@test_approx_eq_eps compare(QGram(1), "abc", "ccc") 1/3 1e-4
@test_approx_eq compare(Partial(RatcliffObershelp()), "New York Yankees", "Yankees") 1.0
@test_approx_eq compare(Partial(RatcliffObershelp()), "New York Yankees", "") 0.0
@test_approx_eq compare(Partial(Hamming()), "New York Yankees", "Yankees") 1
@test_approx_eq compare(Partial(Hamming()), "New York Yankees", "") 1
@test_approx_eq compare(TokenSort(RatcliffObershelp()), "New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") 1.0
@test_approx_eq compare(TokenSet(RatcliffObershelp()),"mariners vs angels", "los angeles angels of anaheim at seattle mariners") 1.0 - 0.09090909090909094
@test_approx_eq compare(TokenSort(RatcliffObershelp()), "New York Mets vs Atlanta Braves", "") 0.0
@test_approx_eq compare(TokenSet(RatcliffObershelp()),"mariners vs angels", "") 0.0

View File

@ -1,7 +1,6 @@
using StringDistances
tests = ["distances.jl"
]
tests = ["distances.jl", "modifiers.jl"]
println("Running tests:")