使用 CLIP 模型进行零样本预测
CLIP(Contrastive Language-Image Pretraining)是由 OpenAI 开发的多模态模型,用于同时处理图像和文本。其核心思想是通过对比学习,将图像和对应的文本描述映射到同一特征空间中。这样,模型能够理解图像与文本之间的关系,并在多种任务中表现出色,如图像分类、图像生成和文本检索等。
参见:
- 博客:CLIP: Connecting text and images | OpenAI
- 论文:Learning Transferable Visual Models From Natural Language Supervision
- 代码:CLIP | GitHub
pip install torch --extra-index-url https://download.pytorch.org/whl/cu128
pip install git+https://github.com/openai/CLIP.git
pip install pillow
import torch as th
import clip
from PIL import Image
device = th.device("cuda" if th.cuda.is_available() else "cpu")
model, preprocess = clip.load("ViT-B/32", device=device) # 加载 CLIP 模型
image = preprocess(Image.open("cat.jpg")).unsqueeze(0).to(device) # 预处理图像
# 定义要比较的文本
texts = ["a photo of a dog", "a photo of a cat", "a photo of a bird"]
tokens = clip.tokenize(texts).to(device)
with th.no_grad():
image_features = model.encode_image(image) # 提取图像特征
text_features = model.encode_text(tokens) # 提取文本特征
# 归一化以便计算余弦相似度
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# 计算余弦相似度
similarity = (image_features @ text_features.T).squeeze(0).softmax(dim=-1)
print("Similarity scores:")
for text, score in zip(texts, similarity):
print(f"'{text}': {100 * score.item():.0f}%")
# 找到最高分的文本
best_match_idx = similarity.argmax().item()
print(f"The image is most similar to: '{texts[best_match_idx]}'")
$ python main.py
Similarity scores:
'a photo of a dog': 33%
'a photo of a cat': 35%
'a photo of a bird': 32%
The image is most similar to: 'a photo of a cat'

浙公网安备 33010602011771号