Chinese CLIP Download - Chinese CLIP Source code download

Chinese Description | English

ModelScope ｜ Demo ｜ Paper ｜ Blog

This project is a Chinese version of the CLIP model, and uses large-scale Chinese data for training (~200 million graphic and text pairs), aiming to help users quickly realize tasks such as graphic and text characteristics and similarity calculation, cross-modal retrieval, and zero-sample image classification in the Chinese field. The code of this project is based on the open_clip project , and is optimized for Chinese field data and to achieve better results on Chinese data. This project provides API, training code and test code, and details will be described in detail below.

news

2023.11.30 Chinese-CLIP adds a conversion script to convert Pytorch model to coreml format for deployment. (Thanks to @manymuch for contributing the code ❤️)
2023.9.8 Chinese-CLIP supports the knowledge distillation fine-tuning function based on the ModelScope library. (Thanks to Alibaba Cloud PAI team @wuziheng and @Jaskr616 for contributing the code ❤️)
2023.5.9 Chinese-CLIP adapts to Pytorch2.0.
2023.3.20 Added gradient accumulation support for comparative learning, which can simulate the training effect of larger batch size
2023.2.16 Added FlashAttention support to improve training speed and reduce video memory usage. For details, please refer to flash_attention.md
2023.1.15 Added new deployment of ONNX and TensorRT models (and provided pre-trained TensorRT models), to improve feature inference speed and meet deployment needs. For details, please refer to deployment.md
2022.12.12 Added a new FLIP training strategy, which can be activated and used during finetune training (Thanks @zwkkk for contribution code ❤️)
2022.12.3 Public the Chinese version of the ELEVATER image classification dataset, please refer to the data document for details
2022.12.1 Chinese-CLIP model code & feature extraction API, synchronously incorporated into Huggingface transformers? Code library
2022.11.22 Added a zero-sample image classification code, which can support the ELEVATER benchmark zero-sample classification evaluation task
2022.11.3 Added RN50, ViT-H-14 model, and published technical report
2022.9.22 Added ViT-L-14 and ViT-L-14-336 models
2022.7.13 Added a fast API for extracting graphic features, a few lines of code quickly call the Chinese CLIP model, and calculate the graphic features & similarity
2022.7.8 The Chinese-CLIP project is officially open source, with open source graphic and text search code

Models and experiments

Model Scale & Download Link

Chinese-CLIP is currently open sourced at 5 different scales, and its model information and download methods are shown in the following table:

Model Size	Download link	Parameter quantity	Visual side skeleton	Visual side parameter quantity	Text side skeleton	Text side parameter quantity	Resolution
CN-CLIP _RN50	Download	77M	ResNet50	38M	RBT3	39M	224
CN-CLIP _ViT-B/16	Download	188M	ViT-B/16	86M	RoBERTa-wwm-Base	102M	224
CN-CLIP _ViT-L/14	Download	406M	ViT-L/14	304M	RoBERTa-wwm-Base	102M	224
CN-CLIP _{ViT-L/14@336px}	Download	407M	ViT-L/14	304M	RoBERTa-wwm-Base	102M	336
CN-CLIP _ViT-H/14	Download	958M	ViT-H/14	632M	RoBERTa-wwm-Large	326M	224

Experimental results

For the graphic and text retrieval task, we conducted zero-shot and finetune experiments on MUGE Retrieval, Flickr30K-CN and COCO-CN. For image zero-sample classification, we conducted experiments on 10 datasets of ELEVATER. The experimental results are shown in the table below. Due to space limitations, we here give the optimal scale model results of the baseline model and Chinese-CLIP. For detailed results indicators of each scale of Chinese-CLIP, please see Results.md for details.

MUGE Text-to-Image Retrieval (Official Validation Set) :

Setup	Zero-shot				Finetune
Metric	R@1	R@5	R@10	MR	R@1	R@5	R@10	MR
Wukong	42.7	69.0	78.0	63.2	52.7	77.9	85.6	72.1
R2D2	49.5	75.7	83.2	69.5	60.1	82.9	89.4	77.5
CN-CLIP	63.0	84.1	89.2	78.8	68.9	88.7	93.1	83.6

Flickr30K-CN Retrieval (Official Test Set) :

Task	Text-to-Image						Image-to-Text
Setup	Zero-shot			Finetune			Zero-shot			Finetune
Metric	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
Wukong	51.7	78.9	86.3	77.4	94.5	97.0	76.1	94.8	97.5	92.7	99.1	99.6
Taiyi	60.8	85.0	91.0	-	-	-	-	-	-	-	-	-
R2D2	60.9	86.8	92.7	84.4	96.7	98.4	77.6	96.7	98.9	95.6	99.8	100.0
CN-CLIP	71.2	91.4	95.5	83.8	96.9	98.6	81.6	97.5	98.8	95.3	99.7	100.0

COCO-CN Retrieval (Official Test Set) :

Task	Text-to-Image						Image-to-Text
Setup	Zero-shot			Finetune			Zero-shot			Finetune
Metric	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
Wukong	53.4	80.2	90.1	74.0	94.4	98.1	55.2	81.0	90.6	73.3	94.0	98.0
Taiyi	60.0	84.0	93.3	-	-	-	-	-	-	-	-	-
R2D2	56.4	85.0	93.1	79.1	96.5	98.9	63.3	89.3	95.7	79.3	97.1	98.7
CN-CLIP	69.2	89.9	96.1	81.5	96.9	99.1	63.0	86.6	92.9	83.5	97.3	99.2

Zero-shot Image Classification :

Task	CIFAR10	CIFAR100	DTD	EuroSAT	FER	FGVC	KITTI	MNIST	PC	VOC
GIT	88.5	61.1	42.9	43.4	41.4	6.7	22.1	68.9	50.0	80.2
ALIGN	94.9	76.8	66.1	52.1	50.8	25.0	41.2	74.0	55.2	83.0
CLIP	94.9	77.0	56.0	63.0	48.3	33.3	11.5	79.0	62.3	84.0
Wukong	95.4	77.1	40.9	50.3	-	-	-	-	-	-
CN-CLIP	96.0	79.7	51.2	52.0	55.1	26.2	49.9	79.4	63.5	84.9

Start using it!

Installation requirements

Before starting this project, you need to check whether the following environmental configuration requirements are met:

python >= 3.6.4
pytorch >= 1.8.0 (with torchvision >= 0.9.0)
CUDA Version >= 10.2

Run the following command to install the three-party libraries required for this project.

pip install -r requirements.txt

Get started with the API quickly

Here is a simple code example to illustrate how to use the Chinese CLIP API. Before starting to use, please install cn_clip:

 # 通过pip安装
pip install cn_clip

# 或者从源代码安装
cd Chinese-CLIP
pip install -e .

After the installation is successful, you can easily call the API through the following methods, pass in the specified image (example) and text, extract the graphic feature vector and calculate the similarity:

 import torch 
from PIL import Image

import cn_clip . clip as clip
from cn_clip . clip import load_from_name , available_models
print ( "Available models:" , available_models ())  
# Available models: ['ViT-B-16', 'ViT-L-14', 'ViT-L-14-336', 'ViT-H-14', 'RN50']

device = "cuda" if torch . cuda . is_available () else "cpu"
model , preprocess = load_from_name ( "ViT-B-16" , device = device , download_root = './' )
model . eval ()
image = preprocess ( Image . open ( "examples/pokemon.jpeg" )). unsqueeze ( 0 ). to ( device )
text = clip . tokenize ([ "杰尼龟" , "妙蛙种子" , "小火龙" , "皮卡丘" ]). to ( device )

with torch . no_grad ():
    image_features = model . encode_image ( image )
    text_features = model . encode_text ( text )
    # 对特征进行归一化，请使用归一化后的图文特征用于下游任务
    image_features /= image_features . norm ( dim = - 1 , keepdim = True ) 
    text_features /= text_features . norm ( dim = - 1 , keepdim = True )    

    logits_per_image , logits_per_text = model . get_similarity ( image , text )
    probs = logits_per_image . softmax ( dim = - 1 ). cpu (). numpy ()

print ( "Label probs:" , probs )  # [[1.268734e-03 5.436878e-02 6.795761e-04 9.436829e-01]]

We have also prepared relevant support for deploying ONNX and TensorRT models. For details, please refer to deployment.md.

If you are not satisfied with just using the API, please continue reading this document to learn how to use our project for training and testing of CLIP models.

Tutorial

The following will include cross-modal retrieval tutorials (including finetune and inference, KNN calculation, etc.) and zero-sample image classification tutorials.

Cross-modal search

Code Organization

After downloading this project, please create a new folder ${DATAPATH} to store the dataset, pre-trained ckpt, and the model log&ckpt generated by finetune. The recommended workspace directory structure is as follows:

 Chinese-CLIP/
├── run_scripts/
│   ├── muge_finetune_vit-b-16_rbt-base.sh
│   ├── flickr30k_finetune_vit-b-16_rbt-base.sh
│   └── ...           # 更多finetune或评测脚本...
└── cn_clip/
    ├── clip/
    ├── eval/
    ├── preprocess/
    └── training/

${DATAPATH}
├── pretrained_weights/
├── experiments/
├── deploy/	      # 用于存放ONNX & TensorRT部署模型
└── datasets/
    ├── MUGE/
    ├── Flickr30k-CN/
    └── .../          # 更多自定义数据集...

Preparation

Here we provide the download method of pre-trained model parameters, as well as the pre-processing process of data before finetune.

Pre-trained CKPT

Please refer to the previous section of Model Scale & Download Link to download the corresponding model ckpt. It is recommended to store the downloaded ckpt file in ${DATAPATH}/pretrained_weights/ directory.

Dataset format preprocessing

In order to adapt to Chinese-CLIP code and ensure the efficiency of data processing and reading, we recommend that the graphics and text data sets used for training and evaluation be organized into the following methods:

 ${DATAPATH}
└── datasets/
    └── ${dataset_name}/
        ├── train_imgs.tsv      # 图片id & 图片内容
        ├── train_texts.jsonl   # 文本id & 文本内容，连同匹配的图片id列表
        ├── valid_imgs.tsv
        ├── valid_texts.jsonl
        ├── test_imgs.tsv
        └── test_texts.jsonl

where ${dataset_name} refers to the dataset name (such as MUGE)

To ensure file processing efficiency, we do not store images in large quantities of small files, but store training/verification/test images in base64 in the ${split}_imgs.tsv file respectively. Each line of the file represents a picture, including the picture id (int type) and the picture base64, separated by tab, and the format is as follows:

 1000002	/9j/4AAQSkZJ...YQj7314oA//2Q==

The way to convert the original image file to base64 is very simple, please execute the following python code:

 from PIL import Image
from io import BytesIO
import base64

img = Image . open ( file_name ) # 访问图片路径
img_buffer = BytesIO ()
img . save ( img_buffer , format = img . format )
byte_data = img_buffer . getvalue ()
base64_str = base64 . b64encode ( byte_data ) # bytes
base64_str = base64_str . decode ( "utf-8" ) # str

The text information and the matching relationship between the graphic and text pairs are saved in the ${split}_texts.jsonl file. Each line of the file is a line of json, the format is as follows:

 {"text_id": 8428, "text": "高级感托特包斜挎", "image_ids": [1076345, 517602]}

For the test set, there is only text and I don't know the matching relationship between the picture and text pairs, the image_ids field of each line can be processed as an empty list, that is, "image_ids": [] .

Finally, we also need to serialize the tsv and jsonl files together and convert them into memory-indexed LMDB database files to facilitate random reading during training.

 python cn_clip/preprocess/build_lmdb_dataset.py 
    --data_dir ${DATAPATH}/datasets/${dataset_name}
    --splits train,valid,test

For example, for MUGE dataset, ${dataset_name} is set to MUGE, and --splits specifies the dataset division that needs to be converted, separated by commas without spaces. After conversion, the following LMDB serialized files will be added to the dataset folder.

 ${DATAPATH}
└── datasets/
    └── ${dataset_name}/
        └── lmdb/
            ├── train
            │   ├── imgs
            │   └── pairs
            ├── valid
            └── test

In order to reduce the difficulty of getting started, we also provide the MUGE data (download link) and Flickr30K-CN data (download link) compression package preprocessed according to the above steps. Just download and decompress and place it in ${DATAPATH}/datasets/ directory. If COCO-CN data is required, please contact us by email after applying for permission from the original author.

Model finetune

Here we introduce the training steps to facilitate other users to understand the model details and use the Chinese CLIP pre-trained model we provide for finetune. Based on the two downstream search datasets of MUGE and Flickr30K-CN, we provide training sample scripts run_scripts/muge_finetune_vit-b-16_rbt-base.sh and run_scripts/flickr30k_finetune_vit-b-16_rbt-base.sh . Running scripts supports both single-machine (single or multiple cards) and multi-machine distributed training. Before running, please fill in the distributed related configurations according to the guidelines and comments at the beginning of the script, and then run the following commands to start training (please run commands on each machine for multi-machine training). For insufficient video memory, you can consider activating the recalculation strategy in the configuration item. The log and model ckpt files generated by training will be automatically saved in the directory specified by the user:

 cd Chinese-CLIP/
bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}

Related training configuration items include:

distributed
- WORKER_CNT : The number of trained machines
- GPUS_PER_NODE : Number of GPUs on each machine
Training/verification data
- train-data : The training data LMDB directory, see above for the preprocessing process for preparing LMDB data files.
- val-data : Verify the data LMDB directory. When specified as None, verification during training will not be performed.
- num-workers : The number of processes in the training set data processing (DataLoader), default is 4.
- valid-num-workers : The number of processes for the verification set data processing (DataLoader) (if validation is performed), default is 1.
Training hyperparameters
- vision-model : Specify the visual backbone, select from ["ViT-B-16", "ViT-L-14", "ViT-L-14-336", "ViT-H-14", "RN50"] .
- text-model : Specify the text backbone, select from ["RoBERTa-wwm-ext-base-chinese", "RoBERTa-wwm-ext-large-chinese", "RBT3-chinese"] .
- context-length : Text input sequence length.
- warmup : warmup steps.
- batch-size : single card batch-size during training. (Please ensure训练样本总数> batch-size * GPU数, which meets at least 1 training batch)
- lr : Learning rate.
- wd : weight decay.
- max-steps : The number of training steps, and the number of training rounds can also be specified through max-epochs .
- freeze-vision : Whether to freeze visual backbone.
- use-augment : Whether to use AutoAugment to enhance the data of the image.
- valid-batch-size : stand-alone batch-size during verification. (Please ensure that验证集样本总数> batch-size * GPU数, satisfying at least 1 verification batch)
- valid-step-interval and valid-epoch-interval : Verify step/epoch frequency. If specified as -1, verification will not be performed during training.
- grad-checkpointing : Use the recalculation strategy to not save intermediate results during the forward process, in exchange for less memory overhead in training time, which is suitable for insufficient memory. ( store_true parameter, just add --grad-checkpointing to the script, currently Pytorch>1.8.0 is required)
- mask-ratio : Referring to the FLIP strategy, a random mask can be specified to a certain proportion of image patches during finetune to reduce memory overhead and speed up training. The default is 0.0, which means that this policy is not activated.
- use-flash-attention : Using FlashAttention can significantly speed up the finetune process of Chinese-CLIP and reduce memory usage without affecting the effect. ( store_true parameter. After configuring the environment, add --use-flash-attention to the script. Please see flash_attention.md for details)
- accum-freq : The gradient accumulation frequency is 1 by default. When specified as an integer greater than 1, the comparative learning gradient accumulation is enabled to simulate a larger batch size. If the single card batch size is m , the total batch size is accum_freq * m * GPU数.
- gather-with-grad : Whether to perform feature gather with full gradients during distributed training, it is turned off by default.
Output options
- name : Specify the output path. The hyperparameter log, training log and output ckpt will be stored in ${DATAPATH}/experiments/${name}/ .
- save-step-frequency and save-epoch-frequency : The interval between the steps or rounds of ckpt.
- report-training-batch-acc : Whether the log reports the accuracy of training graph to text & text to graphics batch.
Weight reading related options
- resume : The path to read by weight. The sample script specifies as a pre-trained ckpt path, or it can be specified as the user's own finetune ckpt path for continuous training.
- reset-data-offset : Whether to continue running from the previous data breakpoint. If the batch size or the GPU card number exceeds the parameter, it is recommended to turn on this option.
- reset-optimizer : Whether to use optimizer state.

After training, the log will automatically exist ${DATAPATH}/experiments/${name}/out_${timestamp}.log . The training log format is as follows:

 2022-12-11,20:40:34 | INFO | Rank 0 | Global Steps: 1/735 | Train Epoch: 1 [1024/250880 (0%)] | Loss: 2.371020 | Image2Text Acc: 49.90 | Text2Image Acc: 48.73 | Data Time: 1.039s | Batch Time: 3.625s | LR: 0.000000 | logit_scale: 4.605 | Global Batch Size: 1024

The verification log format is as follows:

 2022-12-11,20:42:47 | INFO | Rank 0 | Validation Result (epoch 1 @ 150 steps) | Valid Loss: 0.502810 | Image2Text Acc: 84.95 | Text2Image Acc: 84.26 | logit_scale: 4.605 | Valid Batch Size: 128

Note : Comparison of training convergence and stability of learning are correlated with the total batch size. If you use a smaller batch size (compared to the default configuration of 128 per-GPU * 8 GPU), it is recommended to use a smaller learning rate. We recommend using more GPUs and larger batch size for better results.

Forecasting and evaluation

We provide the process of feature extraction and graphic retrieval task evaluation, which is as follows:

Extraction of graphic features

Currently, this code supports the use of GPU single card for graphic feature extraction, please refer to the following command. We also provide support for deploying ONNX and TensorRT models to accelerate feature inference, see deployment.md for details.

 cd Chinese-CLIP/
export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH= ${PYTHONPATH} : ` pwd ` /cn_clip

split=valid # 指定计算valid或test集特征
resume= ${DATAPATH} /pretrained_weights/clip_cn_vit-b-16.pt

python -u cn_clip/eval/extract_features.py 
    --extract-image-feats 
    --extract-text-feats 
    --image-data= " ${DATAPATH} /datasets/ ${dataset_name} /lmdb/ ${split} /imgs " 
    --text-data= " ${DATAPATH} /datasets/ ${dataset_name} / ${split} _texts.jsonl " 
    --img-batch-size=32 
    --text-batch-size=32 
    --context-length=52 
    --resume= ${resume} 
    --vision-model=ViT-B-16 
    --text-model=RoBERTa-wwm-ext-base-chinese

The output graphic features will be saved in the ${DATAPATH}/datasets/${dataset_name} directory by default, and the image features are saved in the ${split}_imgs.img_feat.jsonl file. Each line stores the features of an image in json, and the format is as follows:

 {"image_id": 1000002, "feature": [0.0198, ..., -0.017, 0.0248]}

Text features are saved in ${split}_texts.txt_feat.jsonl , with the format as follows:

 {"text_id": 248816, "feature": [0.1314, ..., 0.0018, -0.0002]}

KNN Search

For small-scale academic search datasets, we provide a simple KNN search implementation to facilitate calculation of top-k recall results for text-to-graphic and graph-to-text search (tips: If you want to build a search demo in the project, it is recommended to produce graphics and text features based on the Chinese CLIP model and combine the open source engineering framework clip-retrieval to build front-end services.)

For text-to-image search (text recall related images), run the following command:

 cd Chinese-CLIP/
split=valid # 指定计算valid或test集特征
python -u cn_clip/eval/make_topk_predictions.py 
    --image-feats= " ${DATAPATH} /datasets/ ${dataset_name} / ${split} _imgs.img_feat.jsonl " 
    --text-feats= " ${DATAPATH} /datasets/ ${dataset_name} / ${split} _texts.txt_feat.jsonl " 
    --top-k=10 
    --eval-batch-size=32768 
    --output= " ${DATAPATH} /datasets/ ${dataset_name} / ${split} _predictions.jsonl "

The result is saved in the specified jsonl file. Each line represents the top-k image id of a text recall, and the format is as follows:

{ "text_id" : 153915 , "image_ids" : [ 5791244 , 1009692167 , 7454547004 , 3564007203 , 38130571 , 2525270674 , 2195419145 , 2503091968 , 4966265765 , 3690431163 ]}

For image-to-text search (image recall related text), similarly, run the following command:

split=valid # 指定计算valid或test集特征
python -u cn_clip/eval/make_topk_predictions_tr.py 
    --image-feats= " ${DATAPATH} /datasets/ ${dataset_name} / ${split} _imgs.img_feat.jsonl " 
    --text-feats= " ${DATAPATH} /datasets/ ${dataset_name} / ${split} _texts.txt_feat.jsonl " 
    --top-k=10 
    --eval-batch-size=32768 
    --output= " ${DATAPATH} /datasets/ ${dataset_name} / ${split} _tr_predictions.jsonl "

Each line of the output results represents the top-k text id of an image recall, and the format is as follows:

{ "image_id" : 977856234 , "text_ids" : [ 156914 , 157914 , 158914 , 155914 , 156179 , 158907 , 157179 , 154179 , 154914 , 154723 ]}

Recall calculation

We provide the evaluation script to calculate the Recall@1/5/10 of the search task, and give the mean recall (the average of Recall@1/5/10). Run the following command to get the score:

For text-to-picture search, run the command:

split=valid # 指定计算valid或test集特征
python cn_clip/eval/evaluation.py 
    ${DATAPATH} /datasets/ ${dataset_name} / ${split} _texts.jsonl 
    ${DATAPATH} /datasets/ ${dataset_name} / ${split} _predictions.jsonl 
    output.json
cat output.json

For image-to-text search, please run the following command first to convert the jsonl file marked with the image-to-text format to image-to-text:

python cn_clip/eval/transform_ir_annotation_to_tr.py 
    --input ${DATAPATH} /datasets/ ${dataset_name} / ${split} _texts.jsonl

After completion, run the command:

split=valid # 指定计算valid或test集特征
python cn_clip/eval/evaluation_tr.py 
    ${DATAPATH} /datasets/ ${dataset_name} / ${split} _texts.tr.jsonl 
    ${DATAPATH} /datasets/ ${dataset_name} / ${split} _tr_predictions.jsonl 
    output.json
cat output.json

The format of the printed result will be as follows:

{ "success" : true , "score" : 85.67 , "scoreJson" : { "score" : 85.67 , "mean_recall" : 85.67 , "r1" : 71.2 , "r5" : 90.5 , "r10" : 95.3 }}

Regarding the training and testing process of cross-modal retrieval, we take the MUGE search dataset (Multimodal E-commerce Graphics and Text Challenge) as an example, and also provides a Jupyter Notebook (download link) that includes all the above processes and can be run. Everyone is welcome to practice it.

Zero sample image classification

This section introduces how to use Chinese-CLIP to implement zero-sample image classification, taking the data set in Benchmark ELEVATER as an example. ELEVATER is an evaluation set composed of multiple well-known classified data sets (including CIFAR-10, CIFAR-100, MNIST, etc.) to evaluate the zero-sample effect of the model on these data sets. In our experiment, we prepared a Chinese version of the propt, category labels and original pictures for each dataset. See the data document for details for this to test the Chinese-CLIP model. For more details about this benchmark, please click on the link. You can also refer to the process we provide to prepare data and test it in your own Chinese classification data set.

Preparation

First prepare the data in the following format. Since zero-sample image classification only requires testing, you only need to prepare the test set and pre-trained model parameters, and store them under the user-specified ${DATAPATH} according to the following directory structure:

 ${DATAPATH}
├── pretrained_weights/
└── datasets/
    └── ${dataset_name}/
        ├── label_cn.txt
        └── test/
	    ├── 000/ # label id，如label个数大于10，则将其向左补零到3位数保证字典序
	    │   ├── image_0003.jpg # 图片样本，命名无特殊要求
	    │   ├── image_0005.jpg
	    │   └── ...
	    ├── 001/
	    │   ├── image_0001.jpg
	    │   ├── image_0002.jpg
	    │   └── ...
	    └── 002/
	        ├── image_0003.jpg
	        ├── image_0005.jpg
	        └── ...
	    ...

The test set ensures that the data in the test folder is divided according to the id corresponding to the label, and ensures that the id is in dictionary order (multiple digits above 10 must be supplemented with zeros to the left label.zfill(3) , such as 001, 002, etc.). label_cn.txt is a data label with a label name per line, as shown below:

手风琴
飞机
锚
...

The label id corresponding to the label of each line is行号-1 , such as the id of the label of the first line is 0, and the id of the label of the second line is 1. If the total number of tags is greater than 10, then zero to three digits will be added to the left, for example, the number of tags is 100, and the tag id will be 000-099 . The user needs to generate the corresponding folder for each label id and put the sample marked with the label into it. We take the CIFAR-100 dataset in ELEVATER as an example, please click the link to download the processed data. If you want to try testing Chinese-CLIP on other datasets contained in ELEVATER, see our data documentation.

Forecasting and evaluation

We have prepared the prediction script, please check out run_scripts/zeroshot_eval.sh . An example of running a command is as follows:

bash run_scripts/zeroshot_eval.sh 0 
    ${DATAPATH} ${dataset_name} 
    ${vision_model} ${text_model} 
    ${ckpt_path} ${index_file}

The meanings of each parameter are:

The first entry parameter 0 is the GPU id
DATAPATH see the preparation section above, enter the corresponding path according to the actual location.
dataset_name See the preparation section above and enter the directory name of the dataset for evaluation, such as cifar-100
vision_model is the specified model type, and the options include ["ViT-B-32", "ViT-B-16", "ViT-L-14", "ViT-L-14-336", "RN50", "ViT-H-14"]
text_model includes ["RoBERTa-wwm-ext-base-chinese", "RoBERTa-wwm-ext-large-chinese", "RBT3-chinese"]
ckpt_path is the complete path of ckpt pre-trained
index_file (optional, only the ELEVATER official website evaluation needs to be specified), please refer to the data document

For example, if you use ViT-B/16 scale pre-trained model to evaluate CIFAR-100, then run ( ${DATAPATH} needs to be replaced according to actual conditions):

bash run_scripts/zeroshot_eval.sh 0 
    ${DATAPATH} cifar-100 
    ViT-B-16 RoBERTa-wwm-ext-base-chinese 
    ${DATAPATH} /pretrained_weights/clip_cn_vit-b-16.pt

Returning the result will print the accuracy of top-1.

 Result:
zeroshot-top1: 0.6444

On CIFAR-100, the Chinese-CLIP of ViT-B/16 scale should be expected to reach 64.4%. For details, please see Results.md for details on our zero-sample classification results for other scales and other datasets.

At the same time, the program will also save a json file for submitting the official ELEVATER. The content of the json file is as follows:

{ "model_name" : " CN-CLIP-ViT-B-16 " , "dataset_name" : " cifar-100 " , "num_trainable_params" : 0 , "num_params" : 188262913 , "num_visual_params" : 86192640 , "num_backbone_params" : 188262913 , "n_shot" : 0 , "rnd_seeds" : [ 123 ], "predictions" : " prediction probability tensor [size: (1, 10000, 100)] " }

These include the meta information of the model name model_name , dataset name dataset_name , total parameter quantity num_params , visual tower parameter quantity num_visual_params and other models, as well as the model output result, that is, the prediction probability tensor of the model, and the size is [1, 样本数, 标签个数] .

Zero sample classification online demo

Based on our feature extraction API integrated into Huggingface transformers, we provide a demo (Hosted inference API) that can simply try zero-sample image classification online in Huggingface Model Hub. See the demo links for each model scale below. Welcome to try it!

OFA-Sys/chinese-clip-vit-base-patch16
OFA-Sys/chinese-clip-vit-large-patch14
OFA-Sys/chinese-clip-vit-large-patch14-336px
OFA-Sys/chinese-clip-vit-huge-patch14
(Updated on 12.10) New version of demo deployed based on Huggingface Spaces : The demo page also contains the above 4 model sizes to choose from, supports inputting custom propt templates, welcome to try it out

Quote

If you think this project is useful, I hope you can give us a star and share it with the users around you. Welcome to the relevant work citation, thank you for your support!

 @article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}

Expand

Chinese CLIP

news

Models and experiments

Model Scale & Download Link

Experimental results

Start using it!

Installation requirements

Get started with the API quickly

Tutorial

Cross-modal search

Code Organization

Preparation

Pre-trained CKPT

Dataset format preprocessing

Model finetune

Forecasting and evaluation

Extraction of graphic features

KNN Search

Recall calculation

Zero sample image classification

Preparation

Forecasting and evaluation

Zero sample classification online demo

Quote

clip_share_client

GitHub sgrebnov/cordova plugin background download

Inf CLIP

wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

Chinese DOS games (Chinese DOS games in browser) project source code official version

Clip Bucket

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express