[笔记]Delphi实现获取字符串相似度

维基百科对字符串相似度(Damerau–Levenshtein distance)的定义是:

In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a "distance" (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or atransposition of two adjacent characters. In his seminal paper[1], Damerau not only distinguished these four edit operations but also stated that they correspond to more than 80% of all human misspellings. Damerau's paper considered only misspellings that could be corrected with at most one edit operation. The corresponding edit distance, i.e., dealing with multiple edit operations, known as the Levenshtein distance, was introduced by Levenshtein,[2] but it did not include transpositions in the set of basic operations. The name Damerau–Levenshtein distance is used to refer to the edit distance that allows multiple edit operations including transpositions, although it is not clear whether the term Damerau–Levenshtein distanceis sometimes used in some sources as to take into account non-adjacent transpositions or not.

简单翻译下,两个字符序列的DL距离,就是从一个变换到另一个的最小操作次数。这个变换包括插入一个字符删除一个字符替换一个字符、或互换两个相邻字符

而所谓“编辑距离(edit distance,或叫Levenshtein distance)”,并不包含互换两个相邻字符

主要应用是在字符拼写检查上,当然也可以用在其他地方,比方不少输入法就提供类似的校正功能(搜狗拼音输入法即实现了此功能)。

看起来简单,实现还是有一定困难的,好在有牛人已经做好相应的函数,如 KambizHow to match two strings approximately 中提供了两个函数:

计算DL距离的函数DamerauLevenshteinDistance(Str1, Str2)

function DamerauLevenshteinDistance(const Str1, Str2: string): Integer;
var
  LenStr1, LenStr2: Integer;
  I, J, T, Cost, Minimum: Integer;
  pStr1, pStr2, S1, S2: PChar;
  D, RowPrv2, RowPrv1, RowCur, Temp: PIntegerArray;
begin
  LenStr1 := Length(Str1);
  LenStr2 := Length(Str2);

  // to save some space, make sure the second index points to the shorter string
  if LenStr1 < LenStr2 then begin
    T := LenStr1;
    LenStr1 := LenStr2;
    LenStr2 := T;
    pStr1 := PChar(Str2);
    pStr2 := PChar(Str1);
  end
  else begin
    pStr1 := PChar(Str1);
    pStr2 := PChar(Str2);
  end;

  // to save some time and space, look for exact match
  while (LenStr2 <> 0) and (pStr1^ = pStr2^) do begin
    Inc(pStr1);
    Inc(pStr2);
    Dec(LenStr1);
    Dec(LenStr2);
  end;

  // when one string is empty, length of the other is the distance
  if LenStr2 = 0 then begin
    Result := LenStr1;
    Exit;
  end;

  // calculate the edit distance
  T := LenStr2 + 1;
  GetMem(D, 3 * T * SizeOf(Integer));
  FillChar(D^, 2 * T * SizeOf(Integer), 0);
  RowCur := D;
  RowPrv1 := @D[T];
  RowPrv2 := @D[2 * T];
  S1 := pStr1;

  for I := 1 to LenStr1 do begin
    Temp := RowPrv2;
    RowPrv2 := RowPrv1;
    RowPrv1 := RowCur;
    RowCur := Temp;
    RowCur[0] := I;
    S2 := pStr2;

    for J := 1 to LenStr2 do begin
      Cost := Ord(S1^ <> S2^);
      Minimum := RowPrv1[J - 1] + Cost;                 // substitution
      T := RowCur[J - 1] + 1;                           // insertion

      if T < Minimum then Minimum := T;

      T := RowPrv1[J] + 1;                              // deletion

      if T < Minimum then Minimum := T;

      if (I <> 1) and (J <> 1) and (S1^ = (S2 - 1)^) and (S2^ = (S1 - 1)^) then begin
        T := RowPrv2[J - 2] + Cost;                     // transposition

        if T < Minimum then Minimum := T;
      end;

      RowCur[J] := Minimum;
      Inc(S2);
    end;

    Inc(S1);
  end;

  Result := RowCur[LenStr2];
  FreeMem(D);
end;

还有计算字符串相似度的函数 StringSimilarityRatio(Str1, Str2, IgnoreCase): Double;

返回值在0到1之间,0表示不相似,1表示完全相似。

function StringSimilarityRatio(const Str1, Str2: string; IgnoreCase: Boolean): Double;
var
  MaxLen: Integer;
  Distance: Integer;
begin
  Result := 1.0;

  if Length(Str1) > Length(Str2) then
    MaxLen := Length(Str1)
  else
    MaxLen := Length(Str2);

  if MaxLen <> 0 then begin
    if IgnoreCase then
      Distance := DamerauLevenshteinDistance(LowerCase(Str1), LowerCase(Str2))
    else
      Distance := DamerauLevenshteinDistance(Str1, Str2);

    Result := Result - (Distance / MaxLen);
  end;
end;

后来data man 参考一个德国人的ApproxStrUtils单元(该单元计算的是L距离,不是DL距离)给出一个据说效率更高的DL距离函数遗憾的是调用它会有“Invalid Pointer Operation”的报错,还没有Debug出问题所在,所以暂时先用前一个版本吧。

function DamerauLevenshteinDistance2(const Str1, Str2: string): Integer;
  function Min(const A, B, C: Integer): Integer; inline;
  begin
    Result := A;
    if B < A then
      Result := B;
    if C < Result then
      Result := C;
  end;

var
  LenStr1, LenStr2: Integer;
  I, J, T, Cost, PrevCost: Integer;
  pStr1, pStr2, S1, S2: PChar;
  D: PIntegerArray;
begin
  LenStr1 := Length(Str1);
  LenStr2 := Length(Str2);

  // to save some space, make sure the second index points to the shorter string
  if LenStr1 < LenStr2 then begin
    T := LenStr1;
    LenStr1 := LenStr2;
    LenStr2 := T;
    pStr1 := PChar(Str2);
    pStr2 := PChar(Str1);
  end
  else begin
    pStr1 := PChar(Str1);
    pStr2 := PChar(Str2);
  end;

  // to save some time and space, look for exact match
  while (LenStr2 <> 0) and (pStr1^ = pStr2^) do begin
    Inc(pStr1);
    Inc(pStr2);
    Dec(LenStr1);
    Dec(LenStr2);
  end;

  while (LenStr2 <> 0) and ((pStr1 + LenStr1 - 1)^ = (pStr2 + LenStr2 - 1)^) do begin
    Dec(LenStr1);
    Dec(LenStr2);
  end;

  if LenStr2 = 0 then begin
    Result := LenStr1;
    Exit;
  end;

  // calculate the edit distance
  T := LenStr2 + 1;
  GetMem(D, T * SizeOf(Integer));

  for I := 0 to T do D[I] := I;

  S1 := pStr1;
  for I := 1 to LenStr1 do begin
    PrevCost := I - 1;
    Cost := I;
    S2 := pStr2;

    for J := 1 to LenStr2 do begin
      if (S1^ = S2^) or ((I > 1) and (J > 1) and (S1^ = (S2 - 1)^) and (S2^ = (S1 - 1)^)) then
        Cost := PrevCost
      else
        Cost := 1 + min(Cost, PrevCost, D[J]);

      PrevCost := D[J];
      D[J] := Cost;
      Inc(S2);
    end;

    Inc(S1);
  end;

  Result := D[LenStr2];
  FreeMem(D);
end;

参考文献:

  1. Damerau–Levenshtein_distance
    http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
  2. How to match two strings approximately
    http://www.delphiarea.com/articles/how-to-match-two-strings-approximately/
  3. Fuzzy string matching
    www.delphiarea.com/articles/how-to-match-two-strings-approximately
  4. Fuzzy search in strings
    http://www.gausi.de/approxstrutils-en.html
posted @ 2011-07-21 14:11  ET民工[源自火星]  阅读(1568)  评论(0编辑  收藏  举报