求字符串相似度
今天做字符串模糊匹配,需要用到求字符串相似度的相关算法,特意了解了一下。
字符串相似度算法介绍: http://blog.csdn.net/dongle2001/archive/2007/01/02/1472235.aspx
http://blog.csdn.net/tanhua103292/archive/2009/08/06/4419494.aspx
Levenshtein Distance 算法 维基百科上面解释得比较清楚 http://en.wikipedia.org/wiki/Levenshtein_distance
The Algorithm
Steps
| Step | Description |
|---|---|
| 1 | Set n to be the length of s. Set m to be the length of t. If n = 0, return m and exit. If m = 0, return n and exit. Construct a matrix containing 0..m rows and 0..n columns. |
| 2 | Initialize the first row to 0..n. Initialize the first column to 0..m. |
| 3 | Examine each character of s (i from 1 to n). |
| 4 | Examine each character of t (j from 1 to m). |
| 5 | If s[i] equals t[j], the cost is 0. If s[i] doesn't equal t[j], the cost is 1. |
| 6 | Set cell d[i,j] of the matrix equal to the minimum of: a. The cell immediately above plus 1: d[i-1,j] + 1. b. The cell immediately to the left plus 1: d[i,j-1] + 1. c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost. |
| 7 | After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m]. |
伪代码(转自维基百科):
int LevenshteinDistance(char s[1..m], char t[1..n])
{
// for all i and j, d[i,j] will hold the Levenshtein distance between
// the first i characters of s and the first j characters of t;
// note that d has (m+1)x(n+1) values
declare int d[0..m, 0..n]
for i from 0 to m
d[i, 0] := i // the distance of any first string to an empty second string
for j from 0 to n
d[0, j] := j // the distance of any second string to an empty first string
for j from 1 to n
{
for i from 1 to m
{
if s[i] = t[j] then
d[i, j] := d[i-1, j-1] // no operation required
else
d[i, j] := minimum
(
d[i-1, j] + 1, // a deletion
d[i, j-1] + 1, // an insertion
d[i-1, j-1] + 1 // a substitution
)
}
}
return d[m,n]
}
Levenshtein Distance 算法分析:
Proof of correctness
As mentioned earlier, the invariant is that we can transform the initial segment s[1..i] into t[1..j] using a minimum of d[i,j] operations. This invariant holds since:
- It is initially true on row and column 0 because
s[1..i]can be transformed into the empty stringt[1..0]by simply dropping allicharacters. Similarly, we can transforms[1..0]tot[1..j]by simply adding alljcharacters. - If
s[i] = t[j], and we can transforms[1..i-1]tot[1..j-1]inkoperations, then we can do the same tos[1..i]and just leave the last character alone, givingkoperations. - Otherwise, the distance is the minimum of the three possible ways to do the transformation:
- If we can transform
s[1..i]tot[1..j-1]inkoperations, then we can simply addt[j]afterwards to gett[1..j]ink+1operations (insertion). - If we can transform
s[1..i-1]tot[1..j]inkoperations, then we can removes[i]and then do the same transformation, for a total ofk+1operations (deletion). - If we can transform
s[1..i-1]tot[1..j-1]inkoperations, then we can do the same tos[1..i], and exchange the originals[i]fort[j]afterwards, for a total ofk+1operations (substitution).
- If we can transform
- The operations required to transform
s[1..n]intot[1..m]is of course the number required to transform all ofsinto all oft, and sod[n,m]holds our result.
This proof fails to validate that the number placed in d[i,j] is in fact minimal; this is more difficult to show, and involves an argument by contradiction in which we assume d[i,j] is smaller than the minimum of the three, and use this to show one of the three is not minimal.
算法实现(c++):
#define MIN(a,b) (a) < (b) ? (a) : (b)
int LevenshteinDistance(const char *source, const char *target)
{
//step 1
int n, m;
if (source == NULL) n = 0;
else n = strlen(source);
if (target == NULL) m = 0;
else m = strlen(target);
if (m == 0) return n;
if (n == 0) return m;
//Construct a matrix
int **matrix = new int*[n+1];
for (int i = 0; i <= n; ++i)
matrix[i] = new int[m+1];
//step 2 Initialize
matrix[0][0] = 0;
for (int i = 0; i <= n; ++i) matrix[i][0] = i;
for (int i = 0; i <= m; ++i) matrix[0][i] = i;
//step 3
char si,tj;
for (int i = 1; i <= n; ++i) {
si = source[i-1];
//step 4
for (int j = 1; j <= m; ++j) {
tj = target[i-1];
//step 5,6
if (si == tj)
matrix[i][j] = matrix[i-1][j-1];
else
matrix[i][j] = MIN(matrix[i-1][j] + 1, \
MIN(matrix[i][j-1] + 1, matrix[i-1][j-1] + 1));
}
}
//step 7
int result = matrix[n][m];
for (int i = 0; i <= n; ++i)
delete[] matrix[i];
delete[] matrix;
return result;
}

浙公网安备 33010602011771号