emmax程序GWAS分析

EMMAX的主要特点:

  1. 线性混合模型:EMMAX使用线性混合模型(Linear Mixed Model, LMM),其中包括固定效应(如待检验的SNP)和随机效应(通常是亲缘关系矩阵或种群结构)来校正样本间的相关性。

  2. 高效计算:它通过使用近似方法来计算随机效应,使得分析大规模数据集(如人类或动植物基因组数据)更为高效。

  3. 种群结构和亲缘关系校正:EMMAX能够有效地处理种群结构和亲缘关系,这在许多群体遗传学研究中非常关键。

  4. 减少假阳性:在考虑种群结构和亲缘关系的情况下,EMMAX有助于减少假阳性的发现,提高GWAS的可靠性。

格式转换和过滤

{bash, eval=FALSE}
# grep -v "^#" snp.vcf|wc -l
# 1,057,193,6ll(10M SNPs) 

# vcf格式转成tped格式(emmax软件要求,plink软件)
singularity  exec  ../software/Reseq_genek.sif plink --vcf ./vcf/snp.vcf --maf 0.05 --geno 0.1 --recode 12 transpose  --output-missing-genotype 0  --out ./data/snp_filter   --set-missing-var-ids @:#  --allow-extra-chr  --keep-allele-order

# 结果会生成snp_filter.tfam,snp_filter.nosex,snp_filter.tped三个文件

# cat snp_filter.tped|wc -l
# 过滤之后统计,剩余2,603,811 SNPs

#替换snp_filter.tped文件染色体名称(1-21)

#计算亲缘关系矩阵
singularity  exec  ../software/Reseq_genek.sif emmax-kin-intel64 -v -d 10  -o ./pop.kinship  snp_filter

数据过滤

--maf 0.05: 过滤掉 Minor Allele Frequency(次等位基因频率)小于 0.05 的变异。

--geno 0.1: 过滤掉在超过 10% 的样本中缺失的变异。

数据转换和输出格式

--recode 12: 以 "12" 格式重新编码基因型数据。在这种格式中,主等位基因(major allele)编码为 "1",次等位基因(minor allele)编码为 "2"。

transpose: 转置输出的矩阵,即行变为列,列变为行。

--output-missing-genotype 0: 将缺失的基因型输出为 "0"。

变异 ID 和染色体

--set-missing-var-ids @:#: 为没有 ID 的变异设置 ID。这里 "@" 代表染色体名,"#" 代表基因座(位置)。

--allow-extra-chr: 允许非标准的染色体名。

--keep-allele-order: 保持输入文件中等位基因的顺序。

GWAS分析

#!/bin/bash

# 定义一些常用的路径和文件名
data_dir="../data"
script_dir="../../Part6.GWAS/script"
software_dir="../../software"
sif_file="../../software/Reseq_genek.sif"

trait_file="$1"
for i in $(cat $trait_file); do
    echo "Processing trait: $i"

    # 表型数据排序
    perl $script_dir/sort_pheno.pl $data_dir/snp_filter.tfam $i > ${i}.sort.txt

    # GWAS 分析
    singularity exec $sif_file emmax-intel64 -v -d 10 -t $data_dir/snp_filter -p ${i}.sort.txt -k $data_dir/pop.kinship -o ${i}.gwas.out 1> ${i}.gwas.log 2> ${i}.gwas.err

    # 提取 p-value
    awk '{print $1, $2, $3, $4}' $data_dir/snp_filter.tped |paste - ${i}.gwas.out.ps | awk 'BEGIN{print "SNP\tCHR\tBP\tP"}{if($2==$5){print $2, $1, $4, $NF}}' > ${i}.gwas.snp

    awk '{print $1, $2, $3, $4}' $data_dir/snp_filter.tped |paste - ${i}.gwas.out.ps | awk 'BEGIN{print "SNP\tCHR\tBP\tP\tbeta\tSE"}{if($2==$5){print $2, $1, $4, $NF,$6,$7}}' > ${i}.gwas.snp1

    # 重命名染色体
    awk 'BEGIN {OFS="\t"} {if ($2 == "1") $2="chr1A"; else if ($2 == "2") $2="chr1B"; else if ($2 == "3") $2="chr1D"; else if ($2 == "4") $2="chr2A"; else if ($2 == "5") $2="chr2B"; else if ($2 == "6") $2="chr2D"; else if ($2 == "7") $2="chr3A"; else if ($2 == "8") $2="chr3B"; else if ($2 == "9") $2="chr3D"; else if ($2 == "10") $2="chr4A"; else if ($2 == "11") $2="chr4B"; else if ($2 == "12") $2="chr4D"; else if ($2 == "13") $2="chr5A"; else if ($2 == "14") $2="chr5B"; else if ($2 == "15") $2="chr5D"; else if ($2 == "16") $2="chr6A"; else if ($2 == "17") $2="chr6B"; else if ($2 == "18") $2="chr6D"; else if ($2 == "19") $2="chr7A"; else if ($2 == "20") $2="chr7B"; else if ($2 == "21") $2="chr7D";} 1' ${i}.gwas.snp > ${i}.gwas.snp2

  # 重命名染色体 
    awk 'BEGIN {OFS="\t"} {if ($2 == "1") $2="chr1A"; else if ($2 == "2") $2="chr1B"; else if ($2 == "3") $2="chr1D"; else if ($2 == "4") $2="chr2A"; else if ($2 == "5") $2="chr2B"; else if ($2 == "6") $2="chr2D"; else if ($2 == "7") $2="chr3A"; else if ($2 == "8") $2="chr3B"; else if ($2 == "9") $2="chr3D"; else if ($2 == "10") $2="chr4A"; else if ($2 == "11") $2="chr4B"; else if ($2 == "12") $2="chr4D"; else if ($2 == "13") $2="chr5A"; else if ($2 == "14") $2="chr5B"; else if ($2 == "15") $2="chr5D"; else if ($2 == "16") $2="chr6A"; else if ($2 == "17") $2="chr6B"; else if ($2 == "18") $2="chr6D"; else if ($2 == "19") $2="chr7A"; else if ($2 == "20") $2="chr7B"; else if ($2 == "21") $2="chr7D";} 1' ${i}.gwas.snp1 > ${i}.gwas.snp3
    
    rm ${i}.gwas.out.ps
    rm ${i}.gwas.snp
    rm ${i}.gwas.snp1

  # 绘制曼哈顿和QQ图
   Rscript ../manhattan_cmplot2.R ${i}.gwas.snp2 ${I}
  
   # 合并
   python merge_fig.py ${i}_manhattan.png ${i}_qq.png ${i}_manhattan_qq.png

done
posted on 2024-01-19 23:16  燕子南飞0415  阅读(833)  评论(0)    收藏  举报