LightDock | 蛋白质-多肽对接 | peptide-protein docking
SOX9因为IDR domain太多,完整的结构基本无法预测,所以从多肽SLiM着手,分段预测,分而治之。
用的是LightDock这个对接工具,非常好用,速度也很快,50个多肽和1个蛋白的对接,时间可以控制在半天之内。
https://lightdock.org/tutorials/0.9.3/basics
最终的汇总PPT,Downloads/SOX9-B1 docking/SOX9-B1 interaction.pptx
关于BAF的PyMol操作
select histones, chain A+G+D+B+C+E+F+H select histones, chain G+H color grey, histones select B1, chain M color red, B1 select dna, resn DA+DT+DG+DC color pink, dna select R, chain R
关于选中B1的contact residues
color red, B1 # 在PyMol里,如何select某些链上没有被其他结构occupied的残基 select not_contacted, (chain M and polymer.protein) and not (byres (chain M and polymer.protein within 5 of (not chain M))) color white, not_contacted select contacted, (chain M and polymer.protein) and (byres (chain M and polymer.protein within 4 of (not chain M))) color white, contacted save B1.cryoEM.fasta, B1, format=fasta save B1.contacted.fasta, contacted, format=fasta # for repaired B1 color red, repaired_B1 select contacted, (repaired_B1 and polymer.protein) and (byres (repaired_B1 and polymer.protein within 3 of (not repaired_B1 and not B1))) color white, contacted save repaired_B1.contacted.fasta, contacted, format=fasta
做MSA找到其fasta序列
# online multiple sequence alignment
# goto: https://www.genome.jp/tools-bin/clustalw
# SLOW/ACCURATE
# Gap Open Penalty: 1
其他选择contact residues的尝试
color red, B1 select surface_residues, B1 and b > 8 select exposed_protein, surface_residues and polymer.protein color white, exposed_protein # 如何选择那些能够与配体binding的残基 select binding_residues, byres (chain M and polymer.protein within 5 of organic) color blue, binding_residues # 如何选择那些能够与潜在蛋白互作的残基 # 如何选择那些没有暴露出来,不可能与蛋白互作的残基 cmd.get_area("B1", load_b=1) select buried_residues, B1 and polymer.protein and b < 10 color white, buried_residues
show surface color red, alpha_B1 color green, B1 select d1, alpha_B1 and (resi 34-47 or resi 173-232 or resi 276-312 or resi 340-352 or resi 366-385) color white, d1 save BAF_B1_contact.pse
把lightdock的swarm与B1 align【最终还是手动做的】
# ----------- # lightdock # swarm cd /Users/zhixinli/Partners HealthCare Dropbox/Zhixin Li/Downloads/SOX9-B1 docking load swarm_centers_B1_1131.pdb, swarms load B1_repaired_model_03.pdb, B1 show surface, B1 color red, B1 # swarm 不需要 align,直接可视化即可 show spheres, swarm set sphere_scale, 0.5, swarm color yellow, swarm center B1 zoom all # 获取 receptor 的中心 get_position
docking的最终准备,B1蛋白处理
## docking前的蛋白预处理 # PDB上的蛋白质,在docking之前,该做哪些预处理,我发现蛋白序列并不完整。 # 1. 检查和清理结构,如何删除水分子,去除配体、小分子和离子? remove solvent remove resn HOH remove hetatm remove resn SO4 remove resn CL remove resn CA remove resn MG remove resn ZN remove resn NA remove resn K save clean_B1.pdb # 2. 处理不完整的氨基酸序列 # https://swissmodel.expasy.org/interactive cd /Users/zhixinli/Partners HealthCare Dropbox/Zhixin Li/Downloads/SOX9-B1 docking load clean_B1.pdb, clean_B1 load B1_repaired_model_03.pdb, repaired_B1 color red, repaired_B1 super clean_B1, repaired_B1 align clean_B1, repaired_B1 save raw_repaired_B1.align.pse # 3. 加氢原子 # chimera your_protein.pdb # Tools → Structure Editing → AddH # File → Save PDB → 命名为 protein_with_H.pdb # ---------------- cd ~/projects/SOX9_BAF/ module load localcolabfold/1.5.2 gcc/9.2.0 colabfold_batch SOX9_SLiM.filtered.fasta SOX9_SLiM_fast_5m --num-recycle 12 --num-models 5 --msa-mode single_sequence --use-gpu-relax --amber zip -r SOX9.SLiM.pdb.5m.zip pdb_5m # rank_001: 排名第一,即最优模型 # alphafold2_ptm_model_5: 使用的模型是 model_5(基于 pTM 评估) # seed_000: 随机种子编号,影响结构采样但通常只在多次运行比较时用到
对接
# ---------------- conda create -n docking_py310 python=3.10 conda activate docking_py310 pip install lightdock which lightdock3_setup.py /home/zz950/softwares/miniconda3/envs/docking_py310/bin/lightdock3_setup.py # GPT: lightdock3_setup.py 核心参数解析 # GPT: 我在PyMol里选中了部分残基,如何导出成restraints.txt文件? # 限定可以docking的残基 # sublime处理重复 # repaired_B1.contacted.BAF.pse iterate (contacted), print(f'M.{resn}.{resi}') # 基本参数 # 设定对接点数量; # 去掉氢和水分子; # 指定残基限制; # 初步探索结构空间 # lightdock3_setup.py B1_repaired_model_03.pdb $p -s 100 -g 200 --noh --now -r restraints.txt # /home/zz950/softwares/miniconda3/envs/docking_py310/bin/lightdock3_setup.py B1_repaired_model_03.pdb pdb/ADTSGVPS_unrelaxed_rank_001_alphafold2_ptm_model_1_seed_000.pdb -s 100 -g 200 --noxt --noh --now -r restraints.txt /home/zz950/softwares/miniconda3/envs/docking_py310/bin/lightdock3_setup.py B1_repaired_model_03.pdb pdb_5m/DVISNIETFDVNEFD_relaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb --noxt --noh --now -r restraints.txt /home/zz950/softwares/miniconda3/envs/docking_py310/bin/lightdock3_setup.py B1_repaired_model_03.pdb $p --noxt --noh --now -r restraints.txt # /home/zz950/softwares/miniconda3/envs/docking_py310/bin/lightdock3.py setup.json 100 -c 1 -l 0 # 正式 docking:50-200 步,根据配体大小和系统复杂度决定 # sipper or ddna /home/zz950/softwares/miniconda3/envs/docking_py310/bin/lightdock3.py setup.json 20 -s sipper
后续处理Pipeline
for d in swarm_*; do for f in "$d"/gso_*.out; do awk '!/^#/ {print FILENAME "\t" $(NF) "\t" NR-1 "\t" $0}' "$f" >> all_scores.tmp done done # sort -k4 -g all_scores.tmp | head -1 > best_result.txt # awk '{print $(NF), $0}' all_scores.tmp | sort -k1 -g | head -10 > best_result.txt sort -k2 -g all_scores.tmp | head -10 > ${p}_best_result_top10.txt /home/zz950/softwares/miniconda3/envs/docking_py310/bin/lgd_generate_conformations.py B1_repaired_model_03.pdb pdb_5m/DVISNIETFDVNEFD_relaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb swarm_1055/gso_20.out 42 /home/zz950/softwares/miniconda3/envs/docking_py310/bin/lgd_generate_conformations.py B1_repaired_model_03.pdb pdb_5m/DVISNIETFDVNEFD_relaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb swarm_1056/gso_10.out 77 /home/zz950/softwares/miniconda3/envs/docking_py310/bin/lgd_generate_conformations.py B1_repaired_model_03.pdb pdb_5m/DVISNIETFDVNEFD_relaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb swarm_494/gso_10.out 97 # clean for rerun rm -r swarm_* setup.json init/ lightdock* pdb_5m/lightdock_* all_scores.tmp
这次学习高度依赖GPT,基本舍弃了Google
# 我如何根据这些结果来获取当前的docking可信度打分 # 这么多swarm结果,已经这么多gso输出,我如何选出最优的那个对比 # gso每一行是一个docking吗 # 我如何获取某一行最佳的那个打分的docking结构 # 为什我生成了1000个swarm文件夹 # 我想docking 多肽和蛋白,应该用哪个打分函数? # 如何修改代码,输出每个swarm的最佳得分呢 # 为什么gso_0.out里的最优得分都是0? # 提取或可视化这个结构的置信度(plddt)分布 # 可以直接在PyMol里可视化pLDDT吗 # 可以在PyMol中可视化LightDock生成的Swarms吗? # 我有swarm的pdb文件,该怎么align,与蛋白一起可视化 # 能解释一下,在每一个swarm处的glowworms,打分函数是根据什么来打分的吗,输入是什么? # 多肽对接蛋白,多肽一部分嵌入了蛋白内,这合理吗? 不合理!!!目前就sipper是合理的 # 但只有sipper 的对接,多肽没有嵌入蛋白,多肽附着在蛋白表面,为什么打分为0,有改进办法吗 # 默认的DEFAULT_CONTACT_RESTRAINTS_CUTOFF = 3.9,这适合我的情况吗 # alphafold生成的relaxed和unrelaxed,哪个模型好 # lightdock的打分有对不同的蛋白大小做normalization吗?不同大小的多肽对接得分可比吗? # 帮我写一个按原子数和接触残基数归一化的脚本
选取最佳打分函数
# test scoring funciton SCORES=(cpydock dfire2 mj3h pisa sd sipper tobi vdw) LOGFILE="timing_stat.log" echo -e "ScoringFunction\tSetupTime(s)\tSimulationTime(s)" > $LOGFILE for SCORE in "${SCORES[@]}" do # clean previous files rm lightdock* pdb_5m/lightdock_* DIR="dock_${SCORE}" mkdir -p "$DIR" cd "$DIR" /home/zz950/softwares/miniconda3/envs/docking_py310/bin/lightdock3_setup.py ../B1_repaired_model_03.pdb ../pdb_5m/DVISNIETFDVNEFD_relaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb --noxt --noh --now -r ../restraints.txt echo "[$SCORE] Running simulation..." start_sim=$(date +%s) /home/zz950/softwares/miniconda3/envs/docking_py310/bin/lightdock3.py setup.json 20 -s $SCORE end_sim=$(date +%s) sim_time=$((end_sim - start_sim)) echo -e "$SCORE\t$setup_time\t$sim_time" | tee -a ../$LOGFILE for d in swarm_*; do for f in "$d"/gso_*.out; do awk '!/^#/ {print FILENAME "\t" $(NF) "\t" NR-1 "\t" $0}' "$f" >> all_scores.tmp done done sort -k2 -g all_scores.tmp | head -1 > best_result_top10.txt read COL1 COL3 < <(awk 'NR==1 {print $1, $3}' best_result_top10.txt) /home/zz950/softwares/miniconda3/envs/docking_py310/bin/lgd_generate_conformations.py ../B1_repaired_model_03.pdb ../pdb_5m/DVISNIETFDVNEFD_relaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb $COL1 $COL3 cp swarm*/lightdock_$((COL3 - 1)).pdb ../${SCORE}_best.pdb cd .. done
-rw-r--r-- 1 zz950 zz950 176 Apr 30 15:41 dock_cpydock/best_result_top10.txt -rw-r--r-- 1 zz950 zz950 176 Apr 30 16:53 dock_dfire2/best_result_top10.txt -rw-r--r-- 1 zz950 zz950 169 Apr 30 17:24 dock_mj3h/best_result_top10.txt -rw-r--r-- 1 zz950 zz950 167 Apr 30 17:47 dock_pisa/best_result_top10.txt -rw-r--r-- 1 zz950 zz950 194 Apr 30 17:56 dock_sd/best_result_top10.txt -rw-r--r-- 1 zz950 zz950 163 Apr 30 17:59 dock_sipper/best_result_top10.txt -rw-r--r-- 1 zz950 zz950 168 Apr 30 18:38 dock_tobi/best_result_top10.txt -rw-r--r-- 1 zz950 zz950 174 Apr 30 18:47 dock_vdw/best_result_top10.txt swarm_1055/gso_10.out -1057.80198153 42 (-31.3301535, -18.6203510, 16.5903817, 0.4594921, -0.3323501, -0.6703852, -0.4785333) 0 0 -1577.07855309 0 4.200 -1057.80198153 swarm_998/gso_10.out -1041.64240828 144 (-21.2007974, -33.8906462, 21.0036514, 0.3935567, 0.5070643, 0.7040320, -0.3038713) 0 0 -1523.76322112 4 3.800 -1041.64240828 swarm_379/gso_10.out -27.36000000 30 (-26.8635348, -12.6567589, 18.2163714, -0.1305292, -0.8561228, -0.0141236, -0.4998165) 0 0 -24.08566743 4 3.640 -27.36000000 swarm_157/gso_20.out -0.83572176 111 (-2.6080655, 35.5076648, 5.7503667, 0.0168021, 0.1792593, 0.8345611, -0.5206645) 0 0 -0.62337494 5 4.840 -0.83572176 swarm_488/gso_10.out -2599291755.02440310 129 (-38.8236518, -15.4687036, 15.6944477, -0.2331482, -0.5301978, 0.8057096, 0.1239529) 0 0 -2892909889.30625200 2 3.240 -2599291755.02440310 swarm_0/gso_0.out 0.00000000 100 (45.9092041, 8.6613108, -0.6781290, 0.3481973, -0.4696026, 0.8106289, 0.0333577) 0 0 5.00000000 0 0.200 0.00000000 swarm_604/gso_10.out -30.26000000 79 (16.8878788, -16.1452200, -7.1088962, 0.8045971, 0.4575122, -0.1986185, 0.3222682) 0 0 -27.56280527 10 2.520 -30.26000000 swarm_1056/gso_10.out -428.29799035 112 (-33.7805647, -18.5711154, 18.4883215, 0.1383592, 0.1644035, -0.8605261, 0.4618690) 0 0 -556.67418304 4 3.720 -428.29799035
不同打分函数耗时
ScoringFunction SetupTime(s) SimulationTime(s) dfire2 4031 mj3h 1767 pisa 1226 sd 472 sipper 64 tobi 2238 vdw 462 cpydock 482 ddna 1162
一些bug
sipper打分为0,需要修改DEFAULT_CONTACT_RESTRAINTS_CUTOFF参数
vi /home/zz950/softwares/miniconda3/envs/docking_py310/lib/python3.10/site-packages/lightdock/constants.py
代码bug
File "/home/zz950/softwares/miniconda3/envs/docking_py310/lib/python3.10/site-packages/lightdock/scoring/sipper/driver.py", line 164, in __call__ DEFAULT_CONTACT_RESTRAINTS_CUTOFF NameError: name 'DEFAULT_CONTACT_RESTRAINTS_CUTOFF' is not defined lightdock/scoring/ddna/data/fort.21_xscore_noH_Met文件缺失 from lightdock.constants import DEFAULT_CONTACT_RESTRAINTS_CUTOFF
参考代码:
- /home/zz950/projects/BAF_SOX9/docking/lightdock
- /home/zhl595/projects/SOX9_BAF