There are a number of advanced variants of minimum edit distance that play a special role in computational biology.

String x : AGA

String y : AG

Needleman-Wunsch

In computational biology, what we care about is the similarity. So we must modify the previous algorithm, minimum edit distance algorithm, to solve the question.

Initialization:

D(i, 0) = -i * d, which means the length of string x is i, while the string y is null, so u should insert some characters to match, which will cost d.

        note now we use negetive to mean cost

D(0, j) = -j * d

d : the cost when u insert or delete one character

s(x[i], y[j]) : when x[i] is equal to y[j], the result is a, or it's b. The value of a and b is based on the according question.

 

Recurrence Relation:

D(i, j) = max{  D(i - 1, j) - d; when u want to insert a character, u will cost d

        D(i, j - 1) - d;

        D(i - 1, j - 1) + s(x[i], y[j]) }

 

Termination:

D(N, M) is distance, which is what we want.

Well, we can make

d = -2 means the cost of insertion or insertion

a = 1 means when x[i] is equal to y[j], then s(x[i], y[j]) = 1

b = -1 means when x[i] isn't equal to y[j], then s(x[i], y[j]) = -1

so, according the formula, u can get the following the value. At last, u can use backtrace to get the path.

 

 Consequently, the result is 0.

 

 

Smith-Waterman

While sometimes we have no need to compare the two total strings, what we want to just to get the local maximum edit distance.

So, we should modify the algorithm so that it doesn't penalize gaps at the begin.

Initialization:

D(i, 0) = 0, it doesn't cost us anything to start from a long string

D(0, j) = 0

d : the cost when u insert or delete one character

s(x[i], y[j]) : when x[i] is equal to y[j], the result is a, or it's b. The value of a and b is based on the according question.

 

Iteration:

D(i, j) = max{   0;

        D(i - 1, j) - d;

        D(i, j - 1) - d;

        D(i - 1, j - 1) + s(x[i], y[j])}

 

Termination:

get D(n, m), then find the maximum value of the matrix

 

Well, we can make

d = -2 means the cost of insertion or insertion

a = 1 means when x[i] is equal to y[j], then s(x[i], y[j]) = 1

b = -1 means when x[i] isn't equal to y[j], then s(x[i], y[j]) = -1

so, according the formula, u can get the following the value. At last, u can use backtrace to get the path.

 Conclusion

In short, we can find the there are a lot of applications based on the edit distance. In computational biogoly, we want to get the maximum, and when we want to correct spell, we want to get the minimum. But, in my opioion, their principle is the same, they all want to get minimum edit distance. In other aspects, even thought we introduce two algorithm above, but I think it's not sufficient, there are more algorithm, so what we should care more about is the core of algorithm, which is according to the condition to use dynamic programming.

posted on 2013-04-02 11:14  MrMission  阅读(407)  评论(0编辑  收藏  举报