生物序列语言模型

Certainly! Here’s a concise table showing for each model whether it takes a sequence as input and whether it is primarily used to generate text (natural language description) or generate embeddings (for clustering, prediction, etc.).


Model Name Input: Sequence Output: Embedding Output: Text
ProtT5-XL-UniRef50
ProtT5-XL-BFD
ESM-1b (esm1b_t33_650M_UR50S)
ESM-2 (esm2_t12_35M_UR50D)
ESM-2 (esm2_t33_650M_UR50D)
Prot2Text-Base-v1-1
BiomedNLP-BiomedBERT-base-uncased-abstract ✅* ✅*

* BiomedBERT: Input is text (not protein sequence), so it’s not used for protein sequence tasks. Output is an embedding or generated biomedical text.


Summary:

  • For protein sequence → embedding: use any ESM or ProtT5 model.
  • For protein sequence → text: use Prot2Text-Base-v1-1.
  • BiomedBERT is only for biomedical language, not protein sequence.

Let me know if you want usage code for a particular model!

esm2 更好

In the MMSite framework (according to the NeurIPS 2024 paper you uploaded), ESM-1b outperforms ESM-2-650M for the active site identification task. This is directly shown in their benchmark results:


Performance Table (extracted from Table 1 and Table 2 in the paper):

Model Fmax AUPRC MCC OS FPR
ESM-1b 0.7052 0.8452 0.7123 0.7211 0.2758
ESM-2-650M 0.6517 0.8230 0.6596 0.6652 0.3240

And after multi-modal enhancement (with text), both improve, but ESM-1b still has better results:

Model + MMSite (Fmax)
ESM-1b 0.8250
ESM-2-650M 0.7965

Conclusion:

In this paper, ESM-1b is consistently better than ESM-2-650M in their framework for active site prediction.

This is consistent across all metrics reported in the tables.

Reason:
The authors do not specifically analyze why, but this may be due to:

  • Task-specific compatibility (ESM-1b may better fit their dataset or loss formulation)
  • Differences in pretraining or tokenization (ESM-2-650M is newer and sometimes outperforms on other tasks, but not here)

Bottom line:

If you follow MMSite, ESM-1b is the preferred model for sequence embeddings.

Let me know if you need implementation advice or want to know when ESM-2 might be better!

posted @ 2025-06-09 19:12  GraphL  阅读(35)  评论(0)    收藏  举报