使用 CLIP 模型进行零样本预测

CLIP(Contrastive Language-Image Pretraining)是由 OpenAI 开发的多模态模型,用于同时处理图像和文本。其核心思想是通过对比学习,将图像和对应的文本描述映射到同一特征空间中。这样,模型能够理解图像与文本之间的关系,并在多种任务中表现出色,如图像分类、图像生成和文本检索等。

参见:

pip install torch --extra-index-url https://download.pytorch.org/whl/cu128
pip install git+https://github.com/openai/CLIP.git
pip install pillow
import torch as th
import clip
from PIL import Image

device = th.device("cuda" if th.cuda.is_available() else "cpu")

model, preprocess = clip.load("ViT-B/32", device=device)           # 加载 CLIP 模型
image = preprocess(Image.open("cat.jpg")).unsqueeze(0).to(device)  # 预处理图像

# 定义要比较的文本
texts = ["a photo of a dog", "a photo of a cat", "a photo of a bird"]
tokens = clip.tokenize(texts).to(device)

with th.no_grad():
    image_features = model.encode_image(image)  # 提取图像特征
    text_features = model.encode_text(tokens)   # 提取文本特征

# 归一化以便计算余弦相似度
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

# 计算余弦相似度
similarity = (image_features @ text_features.T).squeeze(0).softmax(dim=-1)
print("Similarity scores:")
for text, score in zip(texts, similarity):
    print(f"'{text}': {100 * score.item():.0f}%")

# 找到最高分的文本
best_match_idx = similarity.argmax().item()
print(f"The image is most similar to: '{texts[best_match_idx]}'")
$ python main.py
Similarity scores:
'a photo of a dog': 33%
'a photo of a cat': 35%
'a photo of a bird': 32%
The image is most similar to: 'a photo of a cat'
posted @ 2025-06-11 23:37  Undefined443  阅读(41)  评论(0)    收藏  举报