leetcode187 重复的DNA序列（哈希表滑动窗口位运算字典树）

链接：https://leetcode-cn.com/problems/repeated-dna-sequences/

题目

所有 DNA 都由一系列缩写为 'A'，'C'，'G' 和 'T' 的核苷酸组成，例如："ACGAATTCCG"。在研究 DNA 时，识别 DNA 中的重复序列有时会对研究非常有帮助。

编写一个函数来找出所有目标子串，目标子串的长度为 10，且在 DNA 字符串 s 中出现次数超过一次。

示例

示例 1：

输入：s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
输出：["AAAAACCCCC","CCCCCAAAAA"]
示例 2：

输入：s = "AAAAAAAAAAAAA"
输出：["AAAAAAAAAA"]

提示：

0 <= s.length <= 105
s[i] 为 'A'、'C'、'G' 或 'T'

思路

方法1
看到找重复就想到可以遍历存储每一个字符串，我一开始写了个字典树来存储序列不过效果一般又慢内存又大

class Trie{
public:
    vector<Trie*>next;
    int isfind;
    Trie(){
        next=vector<Trie*>(4);
        isfind =0;
    }
};
class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        int n=s.size();
        if(n<11)
            return {};
        vector<string>ans;
        Trie *zd=new Trie;
        unordered_map<char,int>mp;
        mp['A']=0;
        mp['C']=1;
        mp['G']=2;
        mp['T']=3;
        for(int i=0;i<=n-10;++i)
        {
            int ischange=0;
            Trie *ptr=zd;
            for(int j=i;j<i+10;++j)
            {
                int thischar=mp[s[j]];
                if(ptr->next[thischar]==nullptr)
                {
                    ptr->next[thischar]=new Trie;
                    ischange=1;
                }
                ptr=ptr->next[thischar];
            }
            if(ischange==0&&ptr->isfind==0)
            {   
                (ptr->isfind)++;
                ans.push_back(string(s.begin()+i,s.begin()+i+10));
            }
        }
        return ans;
    }
};

方法2
也可以直接使用hash表来进行存储

class Solution {
    const int L = 10;
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> ans;
        unordered_map<string, int> cnt;
        int n = s.length();
        for (int i = 0; i <= n - L; ++i) {
            string sub = s.substr(i, L);
            if (++cnt[sub] == 2) {
                ans.push_back(sub);
            }
        }
        return ans;
    }
};

方法3
看了官方题解非常精妙
因为只有4个字符可以将字符用二进制数分别表示00 01 10 01
因为序列长度为10
因此32位int可以取前20位作为存储
对字符串进行后插操作为(x<<2) | bin[s[i+L-1]
对字符串前删x & ((1<<(L*2))-1)

class Solution {
    const int L=10;
    unordered_map<char,int>bin={{'A',0},{'C',1},{'G',2},{'T',3}};
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string>ans;
        int n =s.size();
        if(n<=L)
            return ans;
        int x=0;
        for(int i=0;i<L-1;++i)
            x=(x<<2) | bin[s[i]];//构建01序列，取32位前20位
        unordered_map<int ,int>cnt;
        for(int i=0;i<=n-L;++i)
        {
            x=((x<<2) | bin[s[i+L-1]]) &((1<<(L*2))-1);
            if(++cnt[x]==2){
                ans.push_back(s.substr(i,L));
            }
        }
        return ans;
    }
};

posted @ 2021-10-08 13:47 kitamu 阅读(63) 评论(0) 收藏举报

刷新页面返回顶部

kitamu

leetcode187 重复的DNA序列（哈希表 滑动窗口 位运算 字典树）

题目

示例

思路

公告

leetcode187 重复的DNA序列（哈希表滑动窗口位运算字典树）