Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].
如果用个hashmap 直接存储sub string,会memory limit exceed。只能用bit manipulation 换成Integer 来存储。 
注意到ASCII码,A is 0x41, C is 0x43, G is 0x47, T is 0x54. 用八进制表示为:

A is 0101, C is 0103, G is 0107, T is 0124. 最后一位八进制各不相同。 而一个Integer的长度为32,正好大于10×3. 

然后用一点位操作就可以将sub string 和integer 联系起来。 

public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
List<String> results = new ArrayList<String>();
Map<Integer, Integer> map = new HashMap<Integer, Integer>();
int number = 0;
for(int i = 0; i < s.length(); i++) {
number = ((number << 3 ) | (s.charAt(i) & 0x7)) & 0x3FFFFFFF;
int occurs = map.containsKey(number) ? map.get(number) + 1 : 1;
if(occurs == 2) results.add(s.substring(i - 9, i + 1)); //避免重复
map.put(number, occurs);
}
return results;
}
}


posted @ 2015-02-07 05:39  江南第一少  阅读(109)  评论(0)    收藏  举报