编辑距离算法
编辑距离算法(Levenshtein Distance)
编辑距离算法,又叫字符串相似度算法。
算法介绍
最近刷CodeWars时遇到了一道题,题目表述如下:
I'm sure, you know Google's "Did you mean ...?", when you entered a search term and mistyped a word. In this kata we want to implement something similar.
You'll get an entered term (lowercase string) and an array of known words (also lowercase strings). Your task is to find out, which word from the dictionary is most similar to the entered one. The similarity is described by the minimum number of letters you have to add, remove or replace in order to get from the entered word to one of the dictionary. The lower the number of required changes, the higher the similarity between each two words.
Same words are obviously the most similar ones. A word that needs one letter to be changed is more similar to another word that needs 2 (or more) letters to be changed. E.g. the mistyped term berr is more similar to beer (1 letter to be replaced) than to barrel (3 letters to be changed in total).
Extend the dictionary in a way, that it is able to return you the most similar word from the list of known words.
Code Examples:
Dictionary fruits = new Dictionary(new String[]{"cherry", "pineapple", "melon", "strawberry", "raspberry"});
fruits.findMostSimilar("strawbery"); // must return "strawberry"
fruits.findMostSimilar("berry"); // must return "cherry"
Dictionary things = new Dictionary(new String[]{"stars", "mars", "wars", "codec", "codewars"});
things.findMostSimilar("coddwars"); // must return "codewars"
Dictionary languages = new Dictionary(new String[]{"javascript", "java", "ruby", "php", "python", "coffeescript"});
languages.findMostSimilar("heaven"); // must return "java"
languages.findMostSimilar("javascript"); // must return "javascript" (same words are obviously the most similar ones)
显然题目中最重要的算法便是如何计算出输入的字符串与字典中每个字符串的相似度。这里就引出了编辑距离算法这个概念。
算法过程
- define input, target, input.length = m, target.length = n; 分别为输入,转换后的字符串,input的长度,target的长度
- if m == 0 return n; if n == 0 return m;
- 初始化m * n 的矩阵matrix[m][n](因为最终结果矩阵是一个对称矩阵,因此初始化n * m矩阵结果相同),对于第一行和第一列,matrix[0][j] = j ; matrix[i][0] = i;
- 对于matrix[i][j] (i != 0 && j != 0) ,t = input[i] == target[j] ? 0 : 1;
- matrix[i][j]取以下三者的最小值:matrix[i - 1][j] +1 ; matrix[i][j - 1] + 1 ; matrix[i - 1][j - 1] + t ;
- return matrix[m - 1][n - 1] ; (右下角即为结果)
通过算法过程伪码我们不难发现,算法其实是一种动态规划。因此理论上使用暴力算法一遍一遍地扫描字符串也可以解出,但逻辑过于复杂,动态规划思想很好地解决了这个问题。
代码实现
public int getEditDistance(String input, String target){
int m = input.length();
int n = target.length();
if(m == 0) return n;
if(n == 0) return m;
int[][] matrix = new int[m][n];
for(int i = 0; i < m; i++){
matrix[i][0] = i;
}
for(int j = 0; j < n; j++){
matrix[0][j] = j;
}
for(int i = 1; i < m; i++)
for(int j = 1; j < n; j++){
int t = input.charAt(i) == target.charAt(j) ? 0 : 1;
matrix[i][j] = Math.min(matrix[i - 1][j - 1] + t,
Math.min(matrix[i - 1][j] + 1,matrix[i][j - 1] + 1));
}
return matrix[m - 1][n - 1];
}