跟着东京大学镰谷研究室学习GWAS分析及可视化 - 实践

跟着东京大学镰谷研究室学习GWAS分析及可视化

REF:https://cloufield.github.io/GWASTutorial/
这里将给大家呈现GWAS分析的一般流程:从数据下载,前期质控,主成分分析,plink2表型关联分析到GWAS的可视化

1. Pre-GWAS(前期准备)

1.1 克隆相关文件

git clone https://github.com/Cloufield/GWASTutorial.git

1.2 下载相关数据集

https://cloufield.github.io/GWASTutorial/01_Dataset/

cd GWASTutorial/01_Dataset
./download_sampledata.sh
# 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bed
# 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim
# 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam

Phenotype Simulation

表型模拟

Phenotypes were simply simulated using GCTA with the 1KG EAS dataset.
使用 GCTA 和 1KG EAS 数据集简单地模拟表型。
# 下载安装acta
wget https://yanglab.westlake.edu.cn/software/gcta/bin/gcta_1.94.0beta.zip
unzip gcta_1.94.0beta.zip
ln -s gcta_1.94.0beta/gcta64 /usr/local/bin/gcta
# 运行 GCTA
# 使用了GCTA软件的数据模拟功能,具体来说是在模拟一个病例对照研(Case-Control Study)​的GWAS数据
gcta  \
--bfile 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing \
--simu-cc 250 254  \ # 250个病例(cases),254个对照(controls)
--simu-causal-loci causal.snplist  \ # 指定因果SNP的列表文件
--simu-hsq 0.8  \ # ​遗传力为0.8,即80%的表型变异由遗传因素解释
--simu-k 0.5  \ # 疾病流行率为50%
--simu-rep 1  \ # 只进行1次重复模拟
--out 1kgeas_binary
# 1kgeas_binary.log
# 1kgeas_binary.par
# 1kgeas_binary.phen

1.3 熟悉各种数据格式及其互相转换

https://cloufield.github.io/GWASTutorial/03_Data_formats/

在这里插入图片描述

2. Genotype Data QC 基因型数据质量控制

https://cloufield.github.io/GWASTutorial/04_Data_QC/

# 打开官网https://www.cog-genomics.org/plink/1.9/
# 安装plink2
wget https://s3.amazonaws.com/plink2-assets/plink2_linux_x86_64_20251019.zip
unzip plink2_linux_x86_64_20251019.zip
# 安装plink1
wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20250819.zip
unzip plink_linux_x86_64_20250819.zip
The functions we will learn in this tutorial:
我们将在本教程中学习的功能:
Calculating missing rate (call rate)
计算缺失率(呼叫率)
Calculating allele Frequency
计算等位基因频率
Conducting Hardy-Weinberg equilibrium exact test
进行 Hardy-Weinberg 平衡精确检验
Applying filters 应用过滤器
Conducting LD-Pruning
进行LD修剪
Calculating inbreeding F coefficient
计算近亲繁殖 F 系数
Conducting sample & SNP filtering (extract/exclude/keep/remove)
进行样品和 SNP 过滤(提取/排除/保留/去除)
Estimating IBD / PI_HAT
估计 IBD / PI_HAT
Calculating LD 计算 LD
Data management (make-bed/recode)
数据管理(补床/重新编码)
执行以下代码完成QC来获得干净的数据集
```bash
#!/bin/bash
#$ -S /bin/bash
export OMP_NUM_THREADS=4
genotypeFile="../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing"
plink \
--bfile ${genotypeFile} \
--missing \ # 计算缺失率
--freq \ # 使用 PLINK1.9 计算变体的 MAF
--hardy \ # 使用 PLINK 计算 Hardy-Weinberg 平衡精确检验统计量
--out plink_results
plink2 \
--bfile ${genotypeFile} \
--freq \ # 使用 PLINK2 计算变异的替代等位基因频率
--out plink_results
plink \
--bfile ${genotypeFile} \
--maf 0.01 \ # 不含 MAF<0.01 的 SNP
--geno 0.01 \ # 过滤掉所有缺失率超过 0.02 的变体
--mind 0.02 \ # 过滤掉所有缺失率超过 0.02 的样本
--hwe 1e-6 \ # 过滤掉 Hardy-Weinberg 平衡精确检验 p 值低于所提供阈值的所有变体
--indep-pairwise 50 5 0.2 \ # 一个50个SNP的窗口,计算窗口中每对 SNP 之间的 LD,如果 LD 大于 0.2,则删除一对 SNP 中的一个;将窗口 5个 SNP 向前移动并重复该过程
--out plink_results
# Sample & SNP filtering (extract/exclude/keep/remove)
# 样品和 SNP 过滤(提取/排除/保留/去除)
plink \
--bfile ${genotypeFile} \
--extract plink_results.prune.in \ # 只使用通过LD筛选的独立SNP集合
--het \ # 计算每个样本的杂合度统计量
--out plink_results
# 使用 awk 创建具有极值 F 的个体样本列表
awk 'NR>1 && $6>0.1 || $6<-0.1 {print $1,$2}' plink_results.het > high_het.sample
  plink \
  --bfile ${genotypeFile} \
  --extract plink_results.prune.in \
  --genome \ #  执行核心计算。它会计算样本两两之间的IBS(状态同源)和IBD(血缘同源)统计量
  --out plink_results
  plink \
  --bfile ${genotypeFile} \
  --chr 22 \
  --r2 \ #  LD计算
  --out plink_results
  # 应用所有过滤器以获得干净的数据集
  plink \
  --bfile ${genotypeFile} \ # 指定输入的二进制基因型文件(前缀为 $genotypeFile的 .bed, .bim, .fam文件)
  --geno 0.02 \ # SNP缺失率过滤​:剔除在所有样本中缺失率高于2%的SNP位点
  --mind 0.02 \ # 样本缺失率过滤​:剔除基因型缺失率高于2%的样本
  --hwe 1e-6 \ # 哈迪-温伯格平衡过滤​:剔除显著偏离哈迪-温伯格平衡的SNP(p值 < 1e-6),这通常是基因分型错误的指标
  --remove high_het.sample \ # 移除异常样本​:根据文件 high_het.sample中的列表移除杂合度异常的样本(这些样本可能受污染或存在近亲繁殖)
  --keep-allele-order \ # ​保持等位基因顺序​:防止PLINK自动重新排列等位基因,确保REF和ALT等位基因的顺序与原始文件一致,这对于后续分析整合至关重要
  --make-bed \ # 生成二进制文件​:将质控后的数据输出为新的二进制格式文件(.bed, .bim, .fam),便于后续分析
  --out sample_data.clean # 输出前缀

3. 主成分分析 (PCA)

https://cloufield.github.io/GWASTutorial/05_PCA/

#!/bin/bash
plinkFile="../04_Data_QC/sample_data.clean"
outPrefix="plink_results"
threadnum=2
# 1. 准备阶段:排除高LD区域
plink \
--bfile ${plinkFile} \
--make-set high-ld-hg19.txt \ # 创建一个高LD区域SNP列表,排除那些已知的、LD结构异常复杂的基因组区域(如人类白细胞抗原HLA区域)
--write-set \
--out hild
# 2. LD剪枝:获取独立SNP集合
# pruning
plink2 \
--bfile ${plinkFile} \
--maf 0.01 \
--exclude hild.set \
--threads ${threadnum} \
--indep-pairwise 500 50 0.2 \
--out ${outPrefix}
# 3. 移除高亲缘关系样本
# remove related samples using knig-cuttoff
plink2 \
--bfile ${plinkFile} \
--extract ${outPrefix}.prune.in \
--king-cutoff 0.0884 \
--threads ${threadnum} \
--out ${outPrefix}
# 4. 主成分计算与投影
# 计算PCA​
# pca after pruning and removing related samples
plink2 \
--bfile ${plinkFile} \
--keep ${outPrefix}.king.cutoff.in.id \
--extract ${outPrefix}.prune.in \
--freq counts \
--threads ${threadnum} \
--pca approx allele-wts \
--out ${outPrefix}
# 投影到所有样本
# projection (related and unrelated samples)
plink2 \
--bfile ${plinkFile} \
--threads ${threadnum} \
--read-freq ${outPrefix}.acount \
--score ${outPrefix}.eigenvec.allele 2 6 header-read no-mean-imputation variance-standardize \
--score-col-nums 7-16 \
--out ${outPrefix}_projected
# ${outPrefix}.eigenval​:记录每个主成分所解释的方差大小。
# ​${outPrefix}.eigenvec​:记录每个样本在前几个主成分上的得分(坐标)。这个文件通常用于绘制PCA散点图,直观展示样本的群体结构。
# ​${outPrefix}_projected.sscore​:所有样本的投影后主成分得分文件。​这个文件就是后续GWAS分析中需要作为协变量输入的文件​。
  • 理解PCA在GWAS中的作用
  • 在GWAS中,​群体结构​ 是导致假阳性关联的主要原因之一。例如,如果某个SNP在病例组中频率高是因为该SNP在某个祖先成分占比较高的亚群中本身频率就高,而该亚群恰好因其他因素(如环境、文化)有更高的患病率,就会产生虚假关联。PCA产生的主成分可以有效量化样本的祖先背景差异。在后续进行基因型-表型关联分析时(例如使用逻辑回归或混合线性模型),将前几个主成分作为协变量纳入模型,可以校正掉这些祖先差异造成的虚假关联,大大提升结果的可靠性

3.1 绘制PCA图

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('default')
# 导入数据
pca = pd.read_table("/mnt/d/01_study/04_GWAS教程/GWASTutorial/05_PCA/plink_results_projected.sscore",sep="\t")
print(pca.shape)
pca.head()
(500, 14)
#FIDIIDALLELE_CTNAMED_ALLELE_DOSAGE_SUMPC1_AVGPC2_AVGPC3_AVGPC4_AVGPC5_AVGPC6_AVGPC7_AVGPC8_AVGPC9_AVGPC10_AVG
0HG00403HG00403390256390256-0.0029030.024865-0.010041-0.0095760.0069420.002221-0.008219-0.0011300.003352-0.004364
1HG00404HG004043906963906960.0001410.027965-0.0253890.005825-0.002747-0.006574-0.0113720.0077710.015978-0.017931
2HG00406HG00406388524388524-0.0070740.0315450.0043700.001262-0.0114910.0054030.0061770.004497-0.0008460.002172
3HG00407HG00407388808388808-0.0068400.0250730.006527-0.006797-0.0116050.010242-0.0139870.0061770.013896-0.008327
4HG00409HG00409391646391646-0.0003990.0290330.0189350.0013600.029043-0.0094150.017134-0.0130410.025343-0.023163
ped = pd.read_table("/mnt/d/01_study/04_GWAS教程/GWASTutorial/01_Dataset/integrated_call_samples_v3.20130502.ALL.panel",sep="\t")
print(pca.shape)
ped.head()
(500, 14)
samplepopsuper_popgenderUnnamed: 4Unnamed: 5
0HG00096GBREURmaleNaNNaN
1HG00097GBREURfemaleNaNNaN
2HG00099GBREURfemaleNaNNaN
3HG00100GBREURfemaleNaNNaN
4HG00101GBREURmaleNaNNaN
# 将PCA结果和人群信息进行合并
pcaped=pd.merge(pca,ped,right_on="sample",left_on="IID",how="inner")
print(pcaped.shape)
pcaped.head()
(500, 20)
#FIDIIDALLELE_CTNAMED_ALLELE_DOSAGE_SUMPC1_AVGPC2_AVGPC3_AVGPC4_AVGPC5_AVGPC6_AVGPC7_AVGPC8_AVGPC9_AVGPC10_AVGsamplepopsuper_popgenderUnnamed: 4Unnamed: 5
0HG00403HG00403390256390256-0.0029030.024865-0.010041-0.0095760.0069420.002221-0.008219-0.0011300.003352-0.004364HG00403CHSEASmaleNaNNaN
1HG00404HG004043906963906960.0001410.027965-0.0253890.005825-0.002747-0.006574-0.0113720.0077710.015978-0.017931HG00404CHSEASfemaleNaNNaN
2HG00406HG00406388524388524-0.0070740.0315450.0043700.001262-0.0114910.0054030.0061770.004497-0.0008460.002172HG00406CHSEASmaleNaNNaN
3HG00407HG00407388808388808-0.0068400.0250730.006527-0.006797-0.0116050.010242-0.0139870.0061770.013896-0.008327HG00407CHSEASfemaleNaNNaN
4HG00409HG00409391646391646-0.0003990.0290330.0189350.0013600.029043-0.0094150.017134-0.0130410.025343-0.023163HG00409CHSEASmaleNaNNaN
# 绘制图像
plt.figure(figsize=(4,4))
sns.scatterplot(data=pcaped,x="PC1_AVG",y="PC2_AVG",hue="pop",s=50)

在这里插入图片描述

4. 对pre-GWAS data利用plink2进行GWAS分析

#!/bin/bash
export OMP_NUM_THREADS=1
genotypeFile="../04_Data_QC/sample_data.clean"  # 经过质控的基因型数据
phenotypeFile="../01_Dataset/1kgeas_binary.txt" # 表型文件
covariateFile="../05_PCA/plink_results_projected.sscore" # 协变量文件(通常包含PCA结果)
covariateCols=6-10 # 指定使用协变量文件中的第6到10列(通常是前几个主成分)
colName="B1" # 指定要分析的表型在表型文件中的列名
threadnum=2
plink2 \
--bfile ${genotypeFile} \
--pheno ${phenotypeFile} \
--pheno-name ${colName} \
--maf 0.01 \ # 过滤掉次要等位基因频率低于0.01的罕见SNP
--covar ${covariateFile} \
--covar-col-nums ${covariateCols} \
--glm hide-covar firth firth-residualize single-prec-cc \ # 执行广义线性模型关联分析。
--threads ${threadnum} \
--out 1kgeas
# hide-covar: 在结果中不输出协变量自身的统计量,使结果文件更简洁。
# firth​: 核心方法​:使用Firth逻辑回归,这对于处理病例-对照不平衡或罕见变异非常有效,能减少假阳性。
# firth-residualize: 在Firth回归中对协变量进行残差化处理,可能提高数值稳定性。
# single-prec-cc: 为病例-对照分析使用单一精度计算,可能是在计算效率和精度间的一种权衡

5. 利用gwaslab进行GWAS可视化

# 安装gwaslab
conda env create -n gwaslab -c conda-forge python=3.12
conda activate gwaslab
pip install gwaslab
import gwaslab as gl
sumstats = gl.Sumstats("/mnt/d/01_study/04_GWAS教程/GWASTutorial/06_Association_tests/1kgeas.B1.glm.firth",fmt="plink2")#
2025/11/01 16:27:03 GWASLab v3.6.9 https://cloufield.github.io/gwaslab/
2025/11/01 16:27:03 (C) 2022-2025, Yunye He, Kamatani Lab, GPL-3.0 license, gwaslab@gmail.com
2025/11/01 16:27:03 Python version: 3.12.12 | packaged by conda-forge | (main, Oct 22 2025, 23:25:55) [GCC 14.3.0]
2025/11/01 16:27:03 Start to load format from formatbook....
2025/11/01 16:27:03  -plink2 format meta info:
2025/11/01 16:27:03   - format_name  : PLINK2 .glm.firth, .glm.logistic,.glm.linear
2025/11/01 16:27:03   - format_source  : https://www.cog-genomics.org/plink/2.0/formats
2025/11/01 16:27:03   - format_version  : Alpha 3.3 final (3 Jun)
2025/11/01 16:27:03   - last_check_date  :  20220806
2025/11/01 16:27:03  -plink2 to gwaslab format dictionary:
2025/11/01 16:27:03   - plink2 keys: ID,#CHROM,POS,REF,ALT,A1,OBS_CT,A1_FREQ,BETA,LOG(OR)_SE,SE,T_STAT,Z_STAT,P,LOG10_P,MACH_R2,OR
2025/11/01 16:27:03   - gwaslab values: SNPID,CHR,POS,REF,ALT,EA,N,EAF,BETA,SE,SE,T,Z,P,MLOG10P,INFO,OR
2025/11/01 16:27:04 Start to initialize gl.Sumstats from file :/mnt/d/01_study/04_GWAS教程/GWASTutorial/06_Association_tests/1kgeas.B1.glm.firth
2025/11/01 16:27:05  -Reading columns          : OBS_CT,ALT,REF,#CHROM,POS,A1,LOG(OR)_SE,A1_FREQ,P,OR,ID,Z_STAT
2025/11/01 16:27:05  -Renaming columns to      : N,ALT,REF,CHR,POS,EA,SE,EAF,P,OR,SNPID,Z
2025/11/01 16:27:05  -Current Dataframe shape : 1128732  x  12
2025/11/01 16:27:05  -Initiating a status column: STATUS ...
2025/11/01 16:27:05  #WARNING! Version of genomic coordinates is unknown...
2025/11/01 16:27:06  NEA not available: assigning REF to NEA...
2025/11/01 16:27:06  -EA,REF and ALT columns are available: assigning NEA...
2025/11/01 16:27:06  -For variants with EA == ALT : assigning REF to NEA ...
2025/11/01 16:27:06  -For variants with EA != ALT : assigning ALT to NEA ...
2025/11/01 16:27:06 Start to reorder the columns...v3.6.9
2025/11/01 16:27:06  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 108.34 MB
2025/11/01 16:27:06  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,SE,Z,P,OR,N,STATUS,REF,ALT
2025/11/01 16:27:06 Finished reordering the columns.
2025/11/01 16:27:06  -Trying to convert datatype for CHR: string -> Int64...Int64
2025/11/01 16:27:06  -Column  : SNPID  CHR   POS   EA       NEA      EAF     SE      Z       P       OR      N     STATUS   REF      ALT
2025/11/01 16:27:06  -DType   : object Int64 int64 category category float64 float64 float64 float64 float64 int64 category category category
2025/11/01 16:27:06  -Verified: T      T     T     T        T        T       T       T       T       T       T     T        T        T
2025/11/01 16:27:06  -Current Dataframe memory usage: 109.42 MB
2025/11/01 16:27:06 Finished loading data successfully!
2025/11/01 16:27:06 Path component detected: ['140008085564128']
2025/11/01 16:27:06 Creating path: ./140008085564128
sumstats.data
SNPIDCHRPOSEANEAEAFSEZPORNSTATUSREFALT
01:15774:G:A115774AG0.0282830.394259-0.7435150.4571700.7459204959999999GA
11:15777:A:G115777GA0.0737370.250121-0.6987900.4846830.8396404959999999AG
21:57292:C:T157292TC0.1046750.2152780.4471060.6547991.1010404929999999CT
31:77874:G:A177874AG0.0191530.4627500.2492850.8031401.1222704969999999GA
41:87360:C:T187360TC0.0231390.4395321.1715700.2413711.6735404979999999CT
.............................................
112872722:51217954:G:A2251217954AG0.0331990.362168-0.9954760.3195050.6973074979999999GA
112872822:51218377:G:C2251218377CG0.0333330.362212-0.9944340.3200120.6975394959999999GC
112872922:51218615:T:A2251218615AT0.0332660.362476-1.0292000.3033850.6886244969999999TA
112873022:51222100:G:T2251222100TG0.0391570.3231770.6178310.5366871.2210004989999999GT
112873122:51239678:G:T2251239678TG0.0341370.3543980.5786120.5628511.2276004989999999GT

1128732 rows × 14 columns

sumstats.get_lead(sig_level=5e-8)
2025/11/01 16:28:49 Start to extract lead variants...v3.6.9
2025/11/01 16:28:49  -Current Dataframe shape : 1128732 x 14 ; Memory usage: 109.42 MB
2025/11/01 16:28:49  -Processing 1128732 variants...
2025/11/01 16:28:49  -Significance threshold : 5e-08
2025/11/01 16:28:49  -Sliding window size: 500  kb
2025/11/01 16:28:49  -Using P for extracting lead variants...
2025/11/01 16:28:49  -Found 63 significant variants in total...
2025/11/01 16:28:49  -Identified 4 lead variants!
2025/11/01 16:28:49 Finished extracting lead variants.
SNPIDCHRPOSEANEAEAFSEZPORNSTATUSREFALT
549041:167562605:G:A1167562605AG0.3914810.1596457.694641.418880e-143.4157904939999999GA
1132002:55587876:T:C255587876CT0.3138830.167300-8.028949.831720e-160.2609984979999999TC
5497057:134326056:G:T7134326056TG0.1348090.2320187.104131.210840e-125.1980504979999999GT
108875020:42758834:T:C2042758834TC0.2272730.184323-7.769057.907810e-150.2388284959999999TC
sumstats.plot_mqq(skip=2,anno=True)
2025/11/01 16:29:19 Start to create MQQ plot...v3.6.9:
2025/11/01 16:29:19  -Genomic coordinates version: 99...
2025/11/01 16:29:19  #WARNING! Genomic coordinates version is unknown.
2025/11/01 16:29:19  -Genome-wide significance level to plot is set to 5e-08 ...
2025/11/01 16:29:19  -Raw input contains 1128732 variants...
2025/11/01 16:29:19  -MQQ plot layout mode is : mqq
2025/11/01 16:29:19 Finished loading specified columns from the sumstats.
2025/11/01 16:29:19 Start data conversion and sanity check:
2025/11/01 16:29:20  -Removed 0 variants with nan in CHR or POS column ...
2025/11/01 16:29:20  -Removed 0 variants with CHR <=0...
2025/11/01 16:29:20  -Removed 0 variants with nan in P column ...
2025/11/01 16:29:20  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
2025/11/01 16:29:20  -Sumstats P values are being converted to -log10(P)...
2025/11/01 16:29:20  -Sanity check: 0 na/inf/-inf variants will be removed...
2025/11/01 16:29:20  -Converting data above cut line...
2025/11/01 16:29:20  -Maximum -log10(P) value is 15.007370498326086 .
2025/11/01 16:29:20 Finished data conversion and sanity check.
2025/11/01 16:29:20 Start to create MQQ plot with 8511 variants...
2025/11/01 16:29:20  -Creating background plot...
2025/11/01 16:29:20 Finished creating MQQ plot successfully!
2025/11/01 16:29:20 Start to extract variants for annotation...
2025/11/01 16:29:20  -Found 4 significant variants with a sliding window size of 500 kb...
2025/11/01 16:29:20 Finished extracting variants for annotation...
2025/11/01 16:29:20 Start to process figure arts.
2025/11/01 16:29:20  -Processing X ticks...
2025/11/01 16:29:20  -Processing X labels...
2025/11/01 16:29:20  -Processing Y labels...
2025/11/01 16:29:20  -Processing Y tick lables...
2025/11/01 16:29:20  -Processing Y labels...
2025/11/01 16:29:20  -Processing lines...
2025/11/01 16:29:20 Finished processing figure arts.
2025/11/01 16:29:20 Start to annotate variants...
2025/11/01 16:29:20  -Annotating using column CHR:POS...
2025/11/01 16:29:20  -Adjusting text positions with repel_force=0.03...
2025/11/01 16:29:20 Finished annotating variants.
2025/11/01 16:29:20 Start to create QQ plot with 8511 variants:
2025/11/01 16:29:20  -Plotting all variants...
2025/11/01 16:29:20  -Expected range of P: (0,1.0)
2025/11/01 16:29:20  -Lambda GC (MLOG10P mode) at 0.5 is   0.99002
2025/11/01 16:29:20  -Processing Y tick lables...
2025/11/01 16:29:20 Finished creating QQ plot successfully!
2025/11/01 16:29:20 Start to save figure...
2025/11/01 16:29:20  -Skip saving figure!
2025/11/01 16:29:20 Finished saving figure...
2025/11/01 16:29:20 Finished creating plot successfully!

在这里插入图片描述

posted on 2025-11-27 15:51  ljbguanli  阅读(0)  评论(0)    收藏  举报