當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

CLIP论文阅读、zero-shot实验、linear prob实验记录

發布時間：2023/12/8 编程问答 57 豆豆

生活随笔收集整理的這篇文章主要介紹了 CLIP论文阅读、zero-shot实验、linear prob实验记录小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

記錄 CLIP 論文閱讀、zero-shot實驗（直接推理）、linear probe實驗（凍結CLIP抽特征只訓練分類層）。

一、論文閱讀

Paper：Learning Transferable Visual Models From Natural Language Supervision
Github：https://github.com/openai/CLIP
參考視頻：CLIP 論文逐段精讀【論文精讀】

CLIP（Contrastive Language-Image Pre-training）是 2021 年 OpenAI 的一篇工作，目的是用文本作為監督信號訓練可遷移的視覺模型，模型原理如下紅框所示：

Text Encoder：用的是 Transformer，12層，8個head，512維特征，分詞器用 BPE 字節對編碼；
Image Encoder：實驗時選了5個不同的 ResNet、EfficientNet 和 3 個不同的 ViT，最終選用 ViT-L/14@336px。

具體模型超參如下：

訓練階段：預訓練目標通過對比學習，讓模型學習文本-圖像對的匹配關系，也就是上面模型原理圖中，藍色對角線為匹配的圖文對。訓練集用的他們自己采集的包含4億個圖文對的 WIT（WebImageText）數據集。
推理階段：用 prompt engineering（比如圖像狗貓二分類，分別輸入 “ A photo of cat ” 和 “ A photo of dog ”，然后分別跟圖像特征算相似度）和 prompt ensemble（設計了80多個prompt，比如可以對 “cat”、“dog” 用不同的prompt構造輸入文本，分別抽特征然后打分）。

以下是模型工作流程的偽代碼：

# image_encoder - 殘差網絡 ResNet 或者 Vision Transformer # text_encoder - CBOW 或者文本 Transformer# I[n, h, w, c] - 訓練圖像，n是batch size，h/w/c分別是高/寬/通道數 # T[n, l] - 訓練文本，n是batch size，l是文本長度 # W_i[d_i, d_e] - 可學習的圖像嵌入投影矩陣 # W_t[d_t, d_e] - 可學習的文本嵌入投影矩陣 # t - softmax 可學習的 temperature 參數# 抽特征 I_f 和 T_f I_f = image_encoder(I) #[n, d_i] T_f = text_encoder(T) #[n, d_t]# 對 I_f、T_f 分別點乘各自投影矩陣，投到同一個向量空間，并做 norm 得到各自特征向量。[n, d_e] I_e = l2_normalize(np.dot(I_f, W_i), axis=1) T_e = l2_normalize(np.dot(T_f, W_t), axis=1)# 算圖文特征的余弦相似度。[n, n] logits = np.dot(I_e, T_e.T) * np.exp(t)# 對稱損失函數（對比學習常用） labels = np.arange(n) loss_i = cross_entropy_loss(logits, labels, axis=0) loss_t = cross_entropy_loss(logits, labels, axis=1) loss = (loss_i + loss_t)/2

二、zero-shot 推理實驗

本部分直接加載預訓練好的模型權重進行 zero-shot 推理。

新建項目 openai_clip，參考 Github，源碼安裝 clip 等依賴：
pip install git+https://github.com/openai/CLIP.git
將模型權重手動下載到本地（ViT-B/32）：
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
存入測試圖片 piano_dog.png ，新建 zero-shot.ipynb，運行代碼：
import torch import clip import os import numpy as np from PIL import Imageos.environ['CUDA_VISIBLE_DEVICES']='1' device = "cuda" if torch.cuda.is_available() else "cpu" clip.available_models()
測試圖片：

可用模型權重列表：
加載模型，查看模型信息：
model, preprocess = clip.load("ckpt/ViT-B-32.pt", device=device)input_resolution = model.visual.input_resolution context_length = model.context_length vocab_size = model.vocab_sizeprint("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}") print("Input resolution:", input_resolution) print("Context length:", context_length) print("Vocab size:", vocab_size)
提取圖文特征，計算相似度：
image = preprocess(Image.open("./dataset/piano_dog.png")).unsqueeze(0).to(device) text = clip.tokenize(["a dog eating an egg", "a dog singing a song", "a dog playing a piano"]).to(device)with torch.no_grad():image_features = model.encode_image(image)text_features = model.encode_text(text)print("圖文特征:", image_features.shape, text_features.shape)logits_per_image, logits_per_text = model(image, text)probs = logits_per_image.softmax(dim=-1).cpu().numpy()print("圖文logits:", image_features.shape, text_features.shape, probs.shape)print("Label probs:", np.around(probs, 3)) # prints: [[0.9927937 0.00421068 0.00299572]]
可見，“a dog playing a piano” 的概率最大。

三、Linear Probe 訓練實驗

論文里提到，凍住 CLIP用來抽特征，然后加個線性層用于分類，用 8-shot 及以上的樣本訓一波之后，效果會比直接 zeroshot 好。所以用 sklearn 實戰一下。

import os import clip import torchimport numpy as np from sklearn.linear_model import LogisticRegression from torch.utils.data import DataLoader from torchvision.datasets import CIFAR100 from tqdm import tqdm# 加載模型 device = "cuda:4" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load('./ckpt/ViT-B-32.pt', device)# 加載 cifar100 數據集 (root 路徑下需包含解壓后的 cifar-100-python 文件夾) root = os.path.expanduser("./dataset/cifar100/") train = CIFAR100(root, download=True, train=True, transform=preprocess) test = CIFAR100(root, download=True, train=False, transform=preprocess)# 模型只用于提取特征 def get_features(dataset):all_features = []all_labels = []with torch.no_grad():for images, labels in tqdm(DataLoader(dataset, batch_size=100)):features = model.encode_image(images.to(device))all_features.append(features)all_labels.append(labels)return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()# 構造 sklearn 的 train_X, train_y, test_X, test_y train_features, train_labels = get_features(train) test_features, test_labels = get_features(test)# 初始化一個 logistic regression 對象 classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1) classifier.fit(train_features, train_labels)# 評測一下 predictions = classifier.predict(test_features) accuracy = np.mean((test_labels == predictions).astype(np.float)) * 100. print(f"Accuracy = {accuracy:.3f}")

總結

以上是生活随笔為你收集整理的CLIP论文阅读、zero-shot实验、linear prob实验记录的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。