187. Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

Subscribe to see which companies asked this question

 
class Solution {
  public List<String> findRepeatedDnaSequences(String s) {
    int N = 10;
    List<String> ret = new ArrayList<>();
    if (s.length() < N + 1)
      return ret;

    // Two hashsets provide a better performance than
    // a map that keeps the count of word.
    Set<Integer> firstAppearance = new HashSet<>();
    Set<Integer> secondAppearance = new HashSet<>();

    char[] map = new char[26];
    //map['A' - 'A'] = 0;
    map['C' - 'A'] = 1;
    map['G' - 'A'] = 2;
    map['T' - 'A'] = 3;

    int v = 0;
    int mask = 3;
    for (int i = 0; i < N - 1; ++i) {
      v <<= 2;
      v |= map[s.charAt(i) - 'A']; //same as v +=  ...

      mask <<= 2;
      mask |= 3;
    }

    for (int i = N - 1; i < s.length(); ++i) {
      v <<= 2;
      v |= map[s.charAt(i) - 'A']; //same as v +=  ...
      v &= mask;  //only keep the last 20 bits.

      if (!firstAppearance.add(v) && secondAppearance.add(v)) {
        ret.add(s.substring(i - N + 1, i + 1));
      }
    }

    return ret;
  }
}

 

posted @ 2016-07-17 12:23  新一代的天皇巨星  阅读(150)  评论(0)    收藏  举报