如何计算给定2个字符串的距离相似性度量



我需要计算两个字符串之间的相似性。那么我到底是什么意思呢?让我举一个例子来解释:

  • 真正的单词:hospital
  • 错字:haspita

现在我的目标是确定我需要修改多少个字符才能获得真正的单词。在这个例子中,我需要修改2个字母。那么百分比是多少呢?我总是用真字的长度。所以它变成2/8=25%,所以这2个给定的字符串DSM是75%。

在性能是关键考虑因素的情况下,我如何实现这一点?

我几周前刚刚解决了这个完全相同的问题。既然现在有人在问,我就分享代码。在我详尽的测试中,即使没有提供最大距离,我的代码也比维基百科上的C#示例快10倍。当提供最大距离时,此性能增益将增加到30x-100x+。注意性能的几个关键点:

  • 如果需要反复比较相同的单词,首先将这些单词转换为整数数组。Damerau-Levenstein算法包括许多>、<、==比较和ints的比较比chars快得多
  • 它包括一个短路机制,当距离超过规定的最大值时退出
  • 使用一个由三个数组组成的旋转集合,而不是像我在其他地方看到的所有实现中那样使用一个巨大的矩阵
  • 确保你的数组在较短的字宽度上切片

代码(如果在参数声明中将int[]替换为String,则其工作原理完全相同:

/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {
    int length1 = source.Length;
    int length2 = target.Length;
    // Return trivial case - difference in string lengths exceeds threshhold
    if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }
    // Ensure arrays [i] / length1 use shorter length 
    if (length1 > length2) {
        Swap(ref target, ref source);
        Swap(ref length1, ref length2);
    }
    int maxi = length1;
    int maxj = length2;
    int[] dCurrent = new int[maxi + 1];
    int[] dMinus1 = new int[maxi + 1];
    int[] dMinus2 = new int[maxi + 1];
    int[] dSwap;
    for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }
    int jm1 = 0, im1 = 0, im2 = -1;
    for (int j = 1; j <= maxj; j++) {
        // Rotate
        dSwap = dMinus2;
        dMinus2 = dMinus1;
        dMinus1 = dCurrent;
        dCurrent = dSwap;
        // Initialize
        int minDistance = int.MaxValue;
        dCurrent[0] = j;
        im1 = 0;
        im2 = -1;
        for (int i = 1; i <= maxi; i++) {
            int cost = source[im1] == target[jm1] ? 0 : 1;
            int del = dCurrent[im1] + 1;
            int ins = dMinus1[i] + 1;
            int sub = dMinus1[im1] + cost;
            //Fastest execution for min value of 3 integers
            int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);
            if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
                min = Math.Min(min, dMinus2[im2] + cost);
            dCurrent[i] = min;
            if (min < minDistance) { minDistance = min; }
            im1++;
            im2++;
        }
        jm1++;
        if (minDistance > threshold) { return int.MaxValue; }
    }
    int result = dCurrent[maxi];
    return (result > threshold) ? int.MaxValue : result;
}

其中Swap为:

static void Swap<T>(ref T arg1,ref T arg2) {
    T temp = arg1;
    arg1 = arg2;
    arg2 = temp;
}

您要查找的是编辑距离Levenstein距离。维基百科的文章解释了它是如何计算的,并且在底部有一段很好的伪代码,可以帮助你很容易地用C#编写这种算法。

以下是链接到以下第一个站点的实现:

private static int  CalcLevenshteinDistance(string a, string b)
    {
    if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
        return 0;
    }
    if (String.IsNullOrEmpty(a)) {
        return b.Length;
    }
    if (String.IsNullOrEmpty(b)) {
        return a.Length;
    }
    int  lengthA   = a.Length;
    int  lengthB   = b.Length;
    var  distances = new int[lengthA + 1, lengthB + 1];
    for (int i = 0;  i <= lengthA;  distances[i, 0] = i++);
    for (int j = 0;  j <= lengthB;  distances[0, j] = j++);
    for (int i = 1;  i <= lengthA;  i++)
        for (int j = 1;  j <= lengthB;  j++)
            {
            int  cost = b[j - 1] == a[i - 1] ? 0 : 1;
            distances[i, j] = Math.Min
                (
                Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
                distances[i - 1, j - 1] + cost
                );
            }
    return distances[lengthA, lengthB];
    }

可以使用大量的字符串相似性距离算法。此处列出的一些(但未详尽列出):

  • Levenstein
  • Needleman Wunch
  • Smith Waterman
  • Smith Waterman Gotoh
  • Jaro,Jaro Winkler
  • Jaccard相似性
  • 欧几里得距离
  • 骰子相似性
  • 余弦相似性
  • Monge Elkan

包含所有这些实现的库称为SimMetrics它同时具有java和c#实现。

我发现Levenstein和Jaro Winkler非常适合字符串之间的小差异,例如:

  • 拼写错误;或
  • ö而不是人名中的o

然而,当比较类似文章标题的东西时,其中文本的重要部分将是相同的;噪声";在边缘,Smith Waterman Gotoh表现出色:

比较这两个标题(相同但不同来源的措辞不同):

一种来自大肠杆菌的核酸内切酶,在紫外线照射的DNA中引入单个多核苷酸链断裂

内切酶III:一种来自大肠杆菌的内切酶,在紫外线照射的DNA中引发单核苷酸链断裂

这个提供字符串算法比较的网站显示:

  • Levenstein:81
  • Smith Waterman Gotoh94
  • Jaro Winkler 78

Jaro Winkler和Levenstein在检测相似性方面不如Smith Waterman Gotoh。如果我们比较两个标题,它们不是同一篇文章,但有一些匹配的文本:

高等植物的脂肪代谢酰基硫酯酶在酰基辅酶A和酰酰基载体蛋白s 代谢中的作用

高等植物的脂肪代谢复杂脂质混合物中酰基载体蛋白酰辅酶A的测定

Jaro Winkler给出了假阳性,但Smith Waterman Gotoh没有:

  • Levenstein:54
  • Smith Waterman Gotoh49
  • Jaro Winkler 89

正如Anastasiosyal所指出的,SimMetrics拥有这些算法的java代码。我成功地使用了SimMetrics的SmithWatermanGooh java代码。

这是我对Damerau-Levenstein Distance的实现,它不仅返回相似系数,还返回更正单词中的错误位置(此功能可用于文本编辑器)。此外,我的实现支持不同权重的错误(替换、删除、插入、换位)。

public static List<Mistake> OptimalStringAlignmentDistance(
  string word, string correctedWord,
  bool transposition = true,
  int substitutionCost = 1,
  int insertionCost = 1,
  int deletionCost = 1,
  int transpositionCost = 1)
{
    int w_length = word.Length;
    int cw_length = correctedWord.Length;
    var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
    var result = new List<Mistake>(Math.Max(w_length, cw_length));
    if (w_length == 0)
    {
        for (int i = 0; i < cw_length; i++)
            result.Add(new Mistake(i, CharMistakeType.Insertion));
        return result;
    }
    for (int i = 0; i <= w_length; i++)
        d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);
    for (int j = 0; j <= cw_length; j++)
        d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);
    for (int i = 1; i <= w_length; i++)
    {
        for (int j = 1; j <= cw_length; j++)
        {
            bool equal = correctedWord[j - 1] == word[i - 1];
            int delCost = d[i - 1, j].Key + deletionCost;
            int insCost = d[i, j - 1].Key + insertionCost;
            int subCost = d[i - 1, j - 1].Key;
            if (!equal)
                subCost += substitutionCost;
            int transCost = int.MaxValue;
            if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
            {
                transCost = d[i - 2, j - 2].Key;
                if (!equal)
                    transCost += transpositionCost;
            }
            int min = delCost;
            CharMistakeType mistakeType = CharMistakeType.Deletion;
            if (insCost < min)
            {
                min = insCost;
                mistakeType = CharMistakeType.Insertion;
            }
            if (subCost < min)
            {
                min = subCost;
                mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
            }
            if (transCost < min)
            {
                min = transCost;
                mistakeType = CharMistakeType.Transposition;
            }
            d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
        }
    }
    int w_ind = w_length;
    int cw_ind = cw_length;
    while (w_ind >= 0 && cw_ind >= 0)
    {
        switch (d[w_ind, cw_ind].Value)
        {
            case CharMistakeType.None:
                w_ind--;
                cw_ind--;
                break;
            case CharMistakeType.Substitution:
                result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
                w_ind--;
                cw_ind--;
                break;
            case CharMistakeType.Deletion:
                result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
                w_ind--;
                break;
            case CharMistakeType.Insertion:
                result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
                cw_ind--;
                break;
            case CharMistakeType.Transposition:
                result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
                w_ind -= 2;
                cw_ind -= 2;
                break;
        }
    }
    if (d[w_length, cw_length].Key > result.Count)
    {
        int delMistakesCount = d[w_length, cw_length].Key - result.Count;
        for (int i = 0; i < delMistakesCount; i++)
            result.Add(new Mistake(0, CharMistakeType.Deletion));
    }
    result.Reverse();
    return result;
}
public struct Mistake
{
    public int Position;
    public CharMistakeType Type;
    public Mistake(int position, CharMistakeType type)
    {
        Position = position;
        Type = type;
    }
    public override string ToString()
    {
        return Position + ", " + Type;
    }
}
public enum CharMistakeType
{
    None,
    Substitution,
    Insertion,
    Deletion,
    Transposition
}

这段代码是我项目的一部分:Yandex-Linguistics.NET.

我写了一些测试,在我看来这种方法是有效的。

但欢迎发表评论和意见。

这里有一种替代方法:

寻找相似性的一个典型方法是Levenstein距离,毫无疑问,库中有可用的代码。

不幸的是,这需要对每个字符串进行比较。您可能可以编写一个专门版本的代码来缩短计算。如果距离大于某个阈值,您仍然需要进行所有比较。

另一个想法是使用一些三角形或n形图的变体。这些是n个字符的序列(或n个单词或n个基因组序列或n个其他)。保持三角图到字符串的映射,并选择重叠最大的。n的典型选择是"n";3〃;,因此得名。

例如,英语会有以下三元图:

  • Eng
  • ngl
  • gli
  • lis
  • ish

英格兰会有:

  • Eng
  • ngl
  • gla
  • 局域网
  • 以及

好吧,7中2(或10中4)匹配。如果这对你有用,你可以索引trigram/string表并获得更快的搜索。

你也可以将其与Levenstein结合起来,将比较集减少到那些有一些最小数量的n-gram共同点的比较集。

下面是一个VB.net实现:

Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
    Dim cost(v1.Length, v2.Length) As Integer
    If v1.Length = 0 Then
        Return v2.Length                'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
    ElseIf v2.Length = 0 Then
        Return v1.Length                'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
    Else
        'setup the base costs for inserting the correct characters
        For v1Count As Integer = 0 To v1.Length
            cost(v1Count, 0) = v1Count
        Next v1Count
        For v2Count As Integer = 0 To v2.Length
            cost(0, v2Count) = v2Count
        Next v2Count
        'now work out the cheapest route to having the correct characters
        For v1Count As Integer = 1 To v1.Length
            For v2Count As Integer = 1 To v2.Length
                'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
                'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1), 
                'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and 
                cost(v1Count, v2Count) = Math.Min(
                    cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
                    Math.Min(
                        cost(v1Count - 1, v2Count) + 1,
                        cost(v1Count, v2Count - 1) + 1
                    )
                )
            Next v2Count
        Next v1Count
        'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
        'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
        Return cost(v1.Length, v2.Length)
    End If
End Function

最新更新