emmax程序GWAS分析
EMMAX的主要特点:
-
线性混合模型:EMMAX使用线性混合模型(Linear Mixed Model, LMM),其中包括固定效应(如待检验的SNP)和随机效应(通常是亲缘关系矩阵或种群结构)来校正样本间的相关性。
-
高效计算:它通过使用近似方法来计算随机效应,使得分析大规模数据集(如人类或动植物基因组数据)更为高效。
-
种群结构和亲缘关系校正:EMMAX能够有效地处理种群结构和亲缘关系,这在许多群体遗传学研究中非常关键。
-
减少假阳性:在考虑种群结构和亲缘关系的情况下,EMMAX有助于减少假阳性的发现,提高GWAS的可靠性。
格式转换和过滤
{bash, eval=FALSE}
# grep -v "^#" snp.vcf|wc -l
# 1,057,193,6ll(10M SNPs)
# vcf格式转成tped格式(emmax软件要求,plink软件)
singularity exec ../software/Reseq_genek.sif plink --vcf ./vcf/snp.vcf --maf 0.05 --geno 0.1 --recode 12 transpose --output-missing-genotype 0 --out ./data/snp_filter --set-missing-var-ids @:# --allow-extra-chr --keep-allele-order
# 结果会生成snp_filter.tfam,snp_filter.nosex,snp_filter.tped三个文件
# cat snp_filter.tped|wc -l
# 过滤之后统计,剩余2,603,811 SNPs
#替换snp_filter.tped文件染色体名称(1-21)
#计算亲缘关系矩阵
singularity exec ../software/Reseq_genek.sif emmax-kin-intel64 -v -d 10 -o ./pop.kinship snp_filter
数据过滤
--maf 0.05: 过滤掉 Minor Allele Frequency(次等位基因频率)小于 0.05 的变异。
--geno 0.1: 过滤掉在超过 10% 的样本中缺失的变异。
数据转换和输出格式
--recode 12: 以 "12" 格式重新编码基因型数据。在这种格式中,主等位基因(major allele)编码为 "1",次等位基因(minor allele)编码为 "2"。
transpose: 转置输出的矩阵,即行变为列,列变为行。
--output-missing-genotype 0: 将缺失的基因型输出为 "0"。
变异 ID 和染色体
--set-missing-var-ids @:#: 为没有 ID 的变异设置 ID。这里 "@" 代表染色体名,"#" 代表基因座(位置)。
--allow-extra-chr: 允许非标准的染色体名。
--keep-allele-order: 保持输入文件中等位基因的顺序。
GWAS分析
#!/bin/bash
# 定义一些常用的路径和文件名
data_dir="../data"
script_dir="../../Part6.GWAS/script"
software_dir="../../software"
sif_file="../../software/Reseq_genek.sif"
trait_file="$1"
for i in $(cat $trait_file); do
echo "Processing trait: $i"
# 表型数据排序
perl $script_dir/sort_pheno.pl $data_dir/snp_filter.tfam $i > ${i}.sort.txt
# GWAS 分析
singularity exec $sif_file emmax-intel64 -v -d 10 -t $data_dir/snp_filter -p ${i}.sort.txt -k $data_dir/pop.kinship -o ${i}.gwas.out 1> ${i}.gwas.log 2> ${i}.gwas.err
# 提取 p-value
awk '{print $1, $2, $3, $4}' $data_dir/snp_filter.tped |paste - ${i}.gwas.out.ps | awk 'BEGIN{print "SNP\tCHR\tBP\tP"}{if($2==$5){print $2, $1, $4, $NF}}' > ${i}.gwas.snp
awk '{print $1, $2, $3, $4}' $data_dir/snp_filter.tped |paste - ${i}.gwas.out.ps | awk 'BEGIN{print "SNP\tCHR\tBP\tP\tbeta\tSE"}{if($2==$5){print $2, $1, $4, $NF,$6,$7}}' > ${i}.gwas.snp1
# 重命名染色体
awk 'BEGIN {OFS="\t"} {if ($2 == "1") $2="chr1A"; else if ($2 == "2") $2="chr1B"; else if ($2 == "3") $2="chr1D"; else if ($2 == "4") $2="chr2A"; else if ($2 == "5") $2="chr2B"; else if ($2 == "6") $2="chr2D"; else if ($2 == "7") $2="chr3A"; else if ($2 == "8") $2="chr3B"; else if ($2 == "9") $2="chr3D"; else if ($2 == "10") $2="chr4A"; else if ($2 == "11") $2="chr4B"; else if ($2 == "12") $2="chr4D"; else if ($2 == "13") $2="chr5A"; else if ($2 == "14") $2="chr5B"; else if ($2 == "15") $2="chr5D"; else if ($2 == "16") $2="chr6A"; else if ($2 == "17") $2="chr6B"; else if ($2 == "18") $2="chr6D"; else if ($2 == "19") $2="chr7A"; else if ($2 == "20") $2="chr7B"; else if ($2 == "21") $2="chr7D";} 1' ${i}.gwas.snp > ${i}.gwas.snp2
# 重命名染色体
awk 'BEGIN {OFS="\t"} {if ($2 == "1") $2="chr1A"; else if ($2 == "2") $2="chr1B"; else if ($2 == "3") $2="chr1D"; else if ($2 == "4") $2="chr2A"; else if ($2 == "5") $2="chr2B"; else if ($2 == "6") $2="chr2D"; else if ($2 == "7") $2="chr3A"; else if ($2 == "8") $2="chr3B"; else if ($2 == "9") $2="chr3D"; else if ($2 == "10") $2="chr4A"; else if ($2 == "11") $2="chr4B"; else if ($2 == "12") $2="chr4D"; else if ($2 == "13") $2="chr5A"; else if ($2 == "14") $2="chr5B"; else if ($2 == "15") $2="chr5D"; else if ($2 == "16") $2="chr6A"; else if ($2 == "17") $2="chr6B"; else if ($2 == "18") $2="chr6D"; else if ($2 == "19") $2="chr7A"; else if ($2 == "20") $2="chr7B"; else if ($2 == "21") $2="chr7D";} 1' ${i}.gwas.snp1 > ${i}.gwas.snp3
rm ${i}.gwas.out.ps
rm ${i}.gwas.snp
rm ${i}.gwas.snp1
# 绘制曼哈顿和QQ图
Rscript ../manhattan_cmplot2.R ${i}.gwas.snp2 ${I}
# 合并
python merge_fig.py ${i}_manhattan.png ${i}_qq.png ${i}_manhattan_qq.png
done
浙公网安备 33010602011771号