【HiFi】CCS数据组装

2021年最后一个月了,里了无数flag,最后还是一事无成。

===============割========================

都2022年了,没时间伤感,还想和2021年一样吗?

==============cut========================

CCS数据组装:

1. 安装组装软件

conda install -c bioconda hifiasm

2. 数据格式转换

samtools fastq *.bam *.fq

samtools fasta *.bam *.fa

3. 看看怎么用

$ hifiasm
Usage: hifiasm [options] <in_1.fq> <in_2.fq> <...>
Options:
Input/Output:
-o STR prefix of output files [hifiasm.asm]
-t INT number of threads [1]
-h show help information
--version show version number
Overlap/Error correction:
-k INT k-mer length (must be <64) [51]
-w INT minimizer window size [51]
-f INT number of bits for bloom filter; 0 to disable [37]
-D FLOAT drop k-mers occurring >FLOAT*coverage times [5.0]
-N INT consider up to max(-D*coverage,-N) overlaps for each oriented read [100]
-r INT round of correction [3]
-z INT length of adapters that should be removed [0]
--max-kocc INT
employ k-mers occurring <INT times to rescue repetitive overlaps [2000]
--hg-size INT(k, m or g)
estimated haploid genome size used for inferring read coverage [auto]
Assembly:
-a INT round of assembly cleaning [4]
-m INT pop bubbles of <INT in size in contig graphs [10000000]
-p INT pop bubbles of <INT in size in unitig graphs [0]
-n INT remove tip unitigs composed of <=INT reads [3]
-x FLOAT max overlap drop ratio [0.8]
-y FLOAT min overlap drop ratio [0.2]
-i ignore saved read correction and overlaps
-u disable post-join step for contigs which may improve N50
--hom-cov INT
homozygous read coverage [auto]
--lowQ INT
output contig regions with >=INT% inconsistency in BED format; 0 to disable [70]
--b-cov INT
break contigs at positions with <INT-fold coverage; work with '--m-rate'; 0 to disable [0]
--h-cov INT
break contigs at positions with >INT-fold coverage; work with '--m-rate'; -1 to disable [-1]
--m-rate FLOAT
break contigs at positions with <=FLOAT*coverage exact overlaps;
only work with '--b-cov' or '--h-cov'[0.75]
--primary output a primary assembly and an alternate assembly
Trio-partition:
-1 FILE hap1/paternal k-mer dump generated by "yak count" []
-2 FILE hap2/maternal k-mer dump generated by "yak count" []
-3 FILE list of hap1/paternal read names []
-4 FILE list of hap2/maternal read names []
-c INT lower bound of the binned k-mer's frequency [2]
-d INT upper bound of the binned k-mer's frequency [5]
--t-occ INT
forcedly remove unitigs with >INT unexpected haplotype-specific reads;
ignore graph topology; [60]
Purge-dups:
-l INT purge level. 0: no purging; 1: light; 2/3: aggressive [0 for trio; 3 for unzip]
-s FLOAT similarity threshold for duplicate haplotigs [0.75 for -l1/-l2, 0.55 for -l3]
-O INT min number of overlapped reads for duplicate haplotigs [1]
--purge-max INT
coverage upper bound of Purge-dups [auto]
--n-hap INT
number of haplotypes [2]
Hi-C-partition:
--h1 FILEs file names of Hi-C R1 [r1_1.fq,r1_2.fq,...]
--h2 FILEs file names of Hi-C R2 [r2_1.fq,r2_2.fq,...]
--seed INT RNG seed [11]
--n-weight INT
rounds of reweighting Hi-C links [3]
--n-perturb INT
rounds of perturbation [10000]
--f-perturb FLOAT
fraction to flip for perturbation [0.1]
--l-msjoin INT
detect misjoined unitigs of >=INT in size; 0 to disable [500000]
Example: ./hifiasm -o NA12878.asm -t 32 NA12878.fq.gz
See `https://hifiasm.readthedocs.io/en/latest/' or `man ./hifiasm.1' for complete documentation.

4. 可以用fq的话,那肯定不错。

数据量多的时候,切分。

5. 先跑,再看原理。

$ hifiasm -o fu2021.25k.fq.asm -t 64 fu2021.25k.fq (这个是只要25k以上的数据,肯定没啥用,程序没问题)

6. 看看污染。

不知道别人怎么做的,我觉得还是先下载细菌基因组,然后用mummer比对。

 

posted on 2021-12-03 11:40  Yuan-SW-F(abysw)  阅读(776)  评论(0编辑  收藏  举报

导航