简单DNA序列组装(非循环子图)
生物信息学原理作业第四弹:DNA序列组装(非循环子图)
原理:生物信息学(孙啸)
大致思想:
1. 这个算法理解细节理解比较困难,建议看孙啸的生物信息学相关章节。
2. 算法要求所有序列覆盖整个目标DNA,并保证相邻片段有足够的覆盖连接(引自孙啸 生物信息学)。
3. 最后推导出符合条件的序列构成的有向图没有回路,并有哈密顿路径。
4. 利用拓扑排序,得到顶点的有序排列。
5. 组装。
贴上Python代码,发现问题我会及时更正。
转载请保留出处!
1 # -*- coding: utf-8 -*-
2 """
3 Created on Sat Dec 2 16:09:14 2017
4 @author: zxzhu
5 python3.6
6 """
7 from functools import reduce
8
9 def get_weight(s1,s2): #通过两条序列的overlap计算出权值
10 l = min(len(s1),len(s2))
11 while l>0:
12 if s2[:l] == s1[-l:]:
13 return l
14 else:
15 l-=1
16 return 0
17
18 def print_result(s1,s2): #将两条序列去除首尾overlap后合并
19 weight = get_weight(s1,s2)
20 s = s1 + s2[weight:]
21 #print(s)
22 return s
23
24 def dir_graph(l,t=3): #得到满足条件的有向图(权值大于等于t)
25 graph = {}
26 for i in l:
27 VW = []
28 for j in l:
29 if i!=j:
30 weight = get_weight(i,j)
31 if weight >= t:
32 VW.append(j)
33 graph[i] = VW
34 #print(graph)
35 for i in graph.keys(): #不能有孤立顶点
36 if not graph[i]:
37 count = get_in_V(graph,i)
38 if count ==0:
39 graph.clear()
40 print('The sequence:\n"{0}"\n can\'t align with others!'.format(i))
41 break
42 return graph
43
44 def get_in_V(graph,v): #得到某顶点入度
45 count = 0
46 all_in = reduce(lambda x,y:x+y,graph.values())
47 for i in all_in:
48 if i == v:
49 count+=1
50 return count
51
52 def aligner(graph,topo=[]): #得出顶点顺序
53 while graph:
54 V = graph.keys()
55 for i in V:
56 flag = 1
57 in_num = get_in_V(graph,i)
58 if in_num ==0:
59 topo.append(i)
60 graph.pop(i)
61 flag = 0
62 break
63 if flag: #存在环
64 #print('The t score is too small!')
65 return None
66 else:
67 aligner(graph,topo)
68 return topo
69
70
71 x = 'CCTTTTGG'
72 y = 'TTGGCAATCACT'
73 w = 'AGTATTGGCAATC'
74 u = 'ATGCAAACCT'
75 z = 'AATCGATG'
76 v = 'TCACTCCTTTT'
77 graph = dir_graph([x,y,z,w,u],t=3)
78 topo = aligner(graph)
79 if topo:
80 result = reduce(print_result,topo)
81 else:
82 result = topo
83 print(result)
Talk is cheap,show me your code!