使用 CLIP 模型进行零样本预测

CLIP（Contrastive Language-Image Pretraining）是由 OpenAI 开发的多模态模型，用于同时处理图像和文本。其核心思想是通过对比学习，将图像和对应的文本描述映射到同一特征空间中。这样，模型能够理解图像与文本之间的关系，并在多种任务中表现出色，如图像分类、图像生成和文本检索等。

参见：

博客：CLIP: Connecting text and images | OpenAI
论文：Learning Transferable Visual Models From Natural Language Supervision
代码：CLIP | GitHub

pip install torch --extra-index-url https://download.pytorch.org/whl/cu128
pip install git+https://github.com/openai/CLIP.git
pip install pillow

import torch as th
import clip
from PIL import Image

device = th.device("cuda" if th.cuda.is_available() else "cpu")

model, preprocess = clip.load("ViT-B/32", device=device)           # 加载 CLIP 模型
image = preprocess(Image.open("cat.jpg")).unsqueeze(0).to(device)  # 预处理图像

# 定义要比较的文本
texts = ["a photo of a dog", "a photo of a cat", "a photo of a bird"]
tokens = clip.tokenize(texts).to(device)

with th.no_grad():
    image_features = model.encode_image(image)  # 提取图像特征
    text_features = model.encode_text(tokens)   # 提取文本特征

# 归一化以便计算余弦相似度
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

# 计算余弦相似度
similarity = (image_features @ text_features.T).squeeze(0).softmax(dim=-1)
print("Similarity scores:")
for text, score in zip(texts, similarity):
    print(f"'{text}': {100 * score.item():.0f}%")

# 找到最高分的文本
best_match_idx = similarity.argmax().item()
print(f"The image is most similar to: '{texts[best_match_idx]}'")

$ python main.py
Similarity scores:
'a photo of a dog': 33%
'a photo of a cat': 35%
'a photo of a bird': 32%
The image is most similar to: 'a photo of a cat'

posted @ 2025-06-11 23:37 Undefined443 阅读(41) 评论(0) 收藏举报

刷新页面返回顶部

undefined443

使用 CLIP 模型进行零样本预测

公告