Chinese Description | English
ModelScope | Demo | Paper | Blog
This project is a Chinese version of the CLIP model, and uses large-scale Chinese data for training (~200 million graphic and text pairs), aiming to help users quickly realize tasks such as graphic and text characteristics and similarity calculation, cross-modal retrieval, and zero-sample image classification in the Chinese field. The code of this project is based on the open_clip project , and is optimized for Chinese field data and to achieve better results on Chinese data. This project provides API, training code and test code, and details will be described in detail below.
Chinese-CLIP is currently open sourced at 5 different scales, and its model information and download methods are shown in the following table:
| Model Size | Download link | Parameter quantity | Visual side skeleton | Visual side parameter quantity | Text side skeleton | Text side parameter quantity | Resolution |
|---|---|---|---|---|---|---|---|
| CN-CLIP RN50 | Download | 77M | ResNet50 | 38M | RBT3 | 39M | 224 |
| CN-CLIP ViT-B/16 | Download | 188M | ViT-B/16 | 86M | RoBERTa-wwm-Base | 102M | 224 |
| CN-CLIP ViT-L/14 | Download | 406M | ViT-L/14 | 304M | RoBERTa-wwm-Base | 102M | 224 |
| CN-CLIP ViT-L/14@336px | Download | 407M | ViT-L/14 | 304M | RoBERTa-wwm-Base | 102M | 336 |
| CN-CLIP ViT-H/14 | Download | 958M | ViT-H/14 | 632M | RoBERTa-wwm-Large | 326M | 224 |
For the graphic and text retrieval task, we conducted zero-shot and finetune experiments on MUGE Retrieval, Flickr30K-CN and COCO-CN. For image zero-sample classification, we conducted experiments on 10 datasets of ELEVATER. The experimental results are shown in the table below. Due to space limitations, we here give the optimal scale model results of the baseline model and Chinese-CLIP. For detailed results indicators of each scale of Chinese-CLIP, please see Results.md for details.
MUGE Text-to-Image Retrieval (Official Validation Set) :
| Setup | Zero-shot | Finetune | ||||||
|---|---|---|---|---|---|---|---|---|
| Metric | R@1 | R@5 | R@10 | MR | R@1 | R@5 | R@10 | MR |
| Wukong | 42.7 | 69.0 | 78.0 | 63.2 | 52.7 | 77.9 | 85.6 | 72.1 |
| R2D2 | 49.5 | 75.7 | 83.2 | 69.5 | 60.1 | 82.9 | 89.4 | 77.5 |
| CN-CLIP | 63.0 | 84.1 | 89.2 | 78.8 | 68.9 | 88.7 | 93.1 | 83.6 |
Flickr30K-CN Retrieval (Official Test Set) :
| Task | Text-to-Image | Image-to-Text | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Setup | Zero-shot | Finetune | Zero-shot | Finetune | ||||||||
| Metric | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
| Wukong | 51.7 | 78.9 | 86.3 | 77.4 | 94.5 | 97.0 | 76.1 | 94.8 | 97.5 | 92.7 | 99.1 | 99.6 |
| Taiyi | 60.8 | 85.0 | 91.0 | - | - | - | - | - | - | - | - | - |
| R2D2 | 60.9 | 86.8 | 92.7 | 84.4 | 96.7 | 98.4 | 77.6 | 96.7 | 98.9 | 95.6 | 99.8 | 100.0 |
| CN-CLIP | 71.2 | 91.4 | 95.5 | 83.8 | 96.9 | 98.6 | 81.6 | 97.5 | 98.8 | 95.3 | 99.7 | 100.0 |
COCO-CN Retrieval (Official Test Set) :
| Task | Text-to-Image | Image-to-Text | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Setup | Zero-shot | Finetune | Zero-shot | Finetune | ||||||||
| Metric | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
| Wukong | 53.4 | 80.2 | 90.1 | 74.0 | 94.4 | 98.1 | 55.2 | 81.0 | 90.6 | 73.3 | 94.0 | 98.0 |
| Taiyi | 60.0 | 84.0 | 93.3 | - | - | - | - | - | - | - | - | - |
| R2D2 | 56.4 | 85.0 | 93.1 | 79.1 | 96.5 | 98.9 | 63.3 | 89.3 | 95.7 | 79.3 | 97.1 | 98.7 |
| CN-CLIP | 69.2 | 89.9 | 96.1 | 81.5 | 96.9 | 99.1 | 63.0 | 86.6 | 92.9 | 83.5 | 97.3 | 99.2 |
Zero-shot Image Classification :
| Task | CIFAR10 | CIFAR100 | DTD | EuroSAT | FER | FGVC | KITTI | MNIST | PC | VOC |
|---|---|---|---|---|---|---|---|---|---|---|
| GIT | 88.5 | 61.1 | 42.9 | 43.4 | 41.4 | 6.7 | 22.1 | 68.9 | 50.0 | 80.2 |
| ALIGN | 94.9 | 76.8 | 66.1 | 52.1 | 50.8 | 25.0 | 41.2 | 74.0 | 55.2 | 83.0 |
| CLIP | 94.9 | 77.0 | 56.0 | 63.0 | 48.3 | 33.3 | 11.5 | 79.0 | 62.3 | 84.0 |
| Wukong | 95.4 | 77.1 | 40.9 | 50.3 | - | - | - | - | - | - |
| CN-CLIP | 96.0 | 79.7 | 51.2 | 52.0 | 55.1 | 26.2 | 49.9 | 79.4 | 63.5 | 84.9 |
Before starting this project, you need to check whether the following environmental configuration requirements are met:
Run the following command to install the three-party libraries required for this project.
pip install -r requirements.txtHere is a simple code example to illustrate how to use the Chinese CLIP API. Before starting to use, please install cn_clip:
# 通过pip安装
pip install cn_clip
# 或者从源代码安装
cd Chinese-CLIP
pip install -e .After the installation is successful, you can easily call the API through the following methods, pass in the specified image (example) and text, extract the graphic feature vector and calculate the similarity:
import torch
from PIL import Image
import cn_clip . clip as clip
from cn_clip . clip import load_from_name , available_models
print ( "Available models:" , available_models ())
# Available models: ['ViT-B-16', 'ViT-L-14', 'ViT-L-14-336', 'ViT-H-14', 'RN50']
device = "cuda" if torch . cuda . is_available () else "cpu"
model , preprocess = load_from_name ( "ViT-B-16" , device = device , download_root = './' )
model . eval ()
image = preprocess ( Image . open ( "examples/pokemon.jpeg" )). unsqueeze ( 0 ). to ( device )
text = clip . tokenize ([ "杰尼龟" , "妙蛙种子" , "小火龙" , "皮卡丘" ]). to ( device )
with torch . no_grad ():
image_features = model . encode_image ( image )
text_features = model . encode_text ( text )
# 对特征进行归一化,请使用归一化后的图文特征用于下游任务
image_features /= image_features . norm ( dim = - 1 , keepdim = True )
text_features /= text_features . norm ( dim = - 1 , keepdim = True )
logits_per_image , logits_per_text = model . get_similarity ( image , text )
probs = logits_per_image . softmax ( dim = - 1 ). cpu (). numpy ()
print ( "Label probs:" , probs ) # [[1.268734e-03 5.436878e-02 6.795761e-04 9.436829e-01]]We have also prepared relevant support for deploying ONNX and TensorRT models. For details, please refer to deployment.md.
If you are not satisfied with just using the API, please continue reading this document to learn how to use our project for training and testing of CLIP models.
The following will include cross-modal retrieval tutorials (including finetune and inference, KNN calculation, etc.) and zero-sample image classification tutorials.
After downloading this project, please create a new folder ${DATAPATH} to store the dataset, pre-trained ckpt, and the model log&ckpt generated by finetune. The recommended workspace directory structure is as follows:
Chinese-CLIP/
├── run_scripts/
│ ├── muge_finetune_vit-b-16_rbt-base.sh
│ ├── flickr30k_finetune_vit-b-16_rbt-base.sh
│ └── ... # 更多finetune或评测脚本...
└── cn_clip/
├── clip/
├── eval/
├── preprocess/
└── training/
${DATAPATH}
├── pretrained_weights/
├── experiments/
├── deploy/ # 用于存放ONNX & TensorRT部署模型
└── datasets/
├── MUGE/
├── Flickr30k-CN/
└── .../ # 更多自定义数据集...
Here we provide the download method of pre-trained model parameters, as well as the pre-processing process of data before finetune.
Please refer to the previous section of Model Scale & Download Link to download the corresponding model ckpt. It is recommended to store the downloaded ckpt file in ${DATAPATH}/pretrained_weights/ directory.
In order to adapt to Chinese-CLIP code and ensure the efficiency of data processing and reading, we recommend that the graphics and text data sets used for training and evaluation be organized into the following methods:
${DATAPATH}
└── datasets/
└── ${dataset_name}/
├── train_imgs.tsv # 图片id & 图片内容
├── train_texts.jsonl # 文本id & 文本内容,连同匹配的图片id列表
├── valid_imgs.tsv
├── valid_texts.jsonl
├── test_imgs.tsv
└── test_texts.jsonl
where ${dataset_name} refers to the dataset name (such as MUGE)
To ensure file processing efficiency, we do not store images in large quantities of small files, but store training/verification/test images in base64 in the ${split}_imgs.tsv file respectively. Each line of the file represents a picture, including the picture id (int type) and the picture base64, separated by tab, and the format is as follows:
1000002 /9j/4AAQSkZJ...YQj7314oA//2Q==
The way to convert the original image file to base64 is very simple, please execute the following python code:
from PIL import Image
from io import BytesIO
import base64
img = Image . open ( file_name ) # 访问图片路径
img_buffer = BytesIO ()
img . save ( img_buffer , format = img . format )
byte_data = img_buffer . getvalue ()
base64_str = base64 . b64encode ( byte_data ) # bytes
base64_str = base64_str . decode ( "utf-8" ) # str The text information and the matching relationship between the graphic and text pairs are saved in the ${split}_texts.jsonl file. Each line of the file is a line of json, the format is as follows:
{"text_id": 8428, "text": "高级感托特包斜挎", "image_ids": [1076345, 517602]}
For the test set, there is only text and I don't know the matching relationship between the picture and text pairs, the image_ids field of each line can be processed as an empty list, that is, "image_ids": [] .
Finally, we also need to serialize the tsv and jsonl files together and convert them into memory-indexed LMDB database files to facilitate random reading during training.
python cn_clip/preprocess/build_lmdb_dataset.py
--data_dir ${DATAPATH}/datasets/${dataset_name}
--splits train,valid,test
For example, for MUGE dataset, ${dataset_name} is set to MUGE, and --splits specifies the dataset division that needs to be converted, separated by commas without spaces. After conversion, the following LMDB serialized files will be added to the dataset folder.
${DATAPATH}
└── datasets/
└── ${dataset_name}/
└── lmdb/
├── train
│ ├── imgs
│ └── pairs
├── valid
└── test
In order to reduce the difficulty of getting started, we also provide the MUGE data (download link) and Flickr30K-CN data (download link) compression package preprocessed according to the above steps. Just download and decompress and place it in ${DATAPATH}/datasets/ directory. If COCO-CN data is required, please contact us by email after applying for permission from the original author.
Here we introduce the training steps to facilitate other users to understand the model details and use the Chinese CLIP pre-trained model we provide for finetune. Based on the two downstream search datasets of MUGE and Flickr30K-CN, we provide training sample scripts run_scripts/muge_finetune_vit-b-16_rbt-base.sh and run_scripts/flickr30k_finetune_vit-b-16_rbt-base.sh . Running scripts supports both single-machine (single or multiple cards) and multi-machine distributed training. Before running, please fill in the distributed related configurations according to the guidelines and comments at the beginning of the script, and then run the following commands to start training (please run commands on each machine for multi-machine training). For insufficient video memory, you can consider activating the recalculation strategy in the configuration item. The log and model ckpt files generated by training will be automatically saved in the directory specified by the user:
cd Chinese-CLIP/
bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}Related training configuration items include:
WORKER_CNT : The number of trained machinesGPUS_PER_NODE : Number of GPUs on each machinetrain-data : The training data LMDB directory, see above for the preprocessing process for preparing LMDB data files.val-data : Verify the data LMDB directory. When specified as None, verification during training will not be performed.num-workers : The number of processes in the training set data processing (DataLoader), default is 4.valid-num-workers : The number of processes for the verification set data processing (DataLoader) (if validation is performed), default is 1.vision-model : Specify the visual backbone, select from ["ViT-B-16", "ViT-L-14", "ViT-L-14-336", "ViT-H-14", "RN50"] .text-model : Specify the text backbone, select from ["RoBERTa-wwm-ext-base-chinese", "RoBERTa-wwm-ext-large-chinese", "RBT3-chinese"] .context-length : Text input sequence length.warmup : warmup steps.batch-size : single card batch-size during training. (Please ensure训练样本总数> batch-size * GPU数, which meets at least 1 training batch)lr : Learning rate.wd : weight decay.max-steps : The number of training steps, and the number of training rounds can also be specified through max-epochs .freeze-vision : Whether to freeze visual backbone.use-augment : Whether to use AutoAugment to enhance the data of the image.valid-batch-size : stand-alone batch-size during verification. (Please ensure that验证集样本总数> batch-size * GPU数, satisfying at least 1 verification batch)valid-step-interval and valid-epoch-interval : Verify step/epoch frequency. If specified as -1, verification will not be performed during training.grad-checkpointing : Use the recalculation strategy to not save intermediate results during the forward process, in exchange for less memory overhead in training time, which is suitable for insufficient memory. ( store_true parameter, just add --grad-checkpointing to the script, currently Pytorch>1.8.0 is required)mask-ratio : Referring to the FLIP strategy, a random mask can be specified to a certain proportion of image patches during finetune to reduce memory overhead and speed up training. The default is 0.0, which means that this policy is not activated.use-flash-attention : Using FlashAttention can significantly speed up the finetune process of Chinese-CLIP and reduce memory usage without affecting the effect. ( store_true parameter. After configuring the environment, add --use-flash-attention to the script. Please see flash_attention.md for details)accum-freq : The gradient accumulation frequency is 1 by default. When specified as an integer greater than 1, the comparative learning gradient accumulation is enabled to simulate a larger batch size. If the single card batch size is m , the total batch size is accum_freq * m * GPU数.gather-with-grad : Whether to perform feature gather with full gradients during distributed training, it is turned off by default.name : Specify the output path. The hyperparameter log, training log and output ckpt will be stored in ${DATAPATH}/experiments/${name}/ .save-step-frequency and save-epoch-frequency : The interval between the steps or rounds of ckpt.report-training-batch-acc : Whether the log reports the accuracy of training graph to text & text to graphics batch.resume : The path to read by weight. The sample script specifies as a pre-trained ckpt path, or it can be specified as the user's own finetune ckpt path for continuous training.reset-data-offset : Whether to continue running from the previous data breakpoint. If the batch size or the GPU card number exceeds the parameter, it is recommended to turn on this option.reset-optimizer : Whether to use optimizer state. After training, the log will automatically exist ${DATAPATH}/experiments/${name}/out_${timestamp}.log . The training log format is as follows:
2022-12-11,20:40:34 | INFO | Rank 0 | Global Steps: 1/735 | Train Epoch: 1 [1024/250880 (0%)] | Loss: 2.371020 | Image2Text Acc: 49.90 | Text2Image Acc: 48.73 | Data Time: 1.039s | Batch Time: 3.625s | LR: 0.000000 | logit_scale: 4.605 | Global Batch Size: 1024
The verification log format is as follows:
2022-12-11,20:42:47 | INFO | Rank 0 | Validation Result (epoch 1 @ 150 steps) | Valid Loss: 0.502810 | Image2Text Acc: 84.95 | Text2Image Acc: 84.26 | logit_scale: 4.605 | Valid Batch Size: 128
Note : Comparison of training convergence and stability of learning are correlated with the total batch size. If you use a smaller batch size (compared to the default configuration of 128 per-GPU * 8 GPU), it is recommended to use a smaller learning rate. We recommend using more GPUs and larger batch size for better results.
We provide the process of feature extraction and graphic retrieval task evaluation, which is as follows:
Currently, this code supports the use of GPU single card for graphic feature extraction, please refer to the following command. We also provide support for deploying ONNX and TensorRT models to accelerate feature inference, see deployment.md for details.
cd Chinese-CLIP/
export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH= ${PYTHONPATH} : ` pwd ` /cn_clip
split=valid # 指定计算valid或test集特征
resume= ${DATAPATH} /pretrained_weights/clip_cn_vit-b-16.pt
python -u cn_clip/eval/extract_features.py
--extract-image-feats
--extract-text-feats
--image-data= " ${DATAPATH} /datasets/ ${dataset_name} /lmdb/ ${split} /imgs "
--text-data= " ${DATAPATH} /datasets/ ${dataset_name} / ${split} _texts.jsonl "
--img-batch-size=32
--text-batch-size=32
--context-length=52
--resume= ${resume}
--vision-model=ViT-B-16
--text-model=RoBERTa-wwm-ext-base-chinese The output graphic features will be saved in the ${DATAPATH}/datasets/${dataset_name} directory by default, and the image features are saved in the ${split}_imgs.img_feat.jsonl file. Each line stores the features of an image in json, and the format is as follows:
{"image_id": 1000002, "feature": [0.0198, ..., -0.017, 0.0248]}
Text features are saved in ${split}_texts.txt_feat.jsonl , with the format as follows:
{"text_id": 248816, "feature": [0.1314, ..., 0.0018, -0.0002]}
For small-scale academic search datasets, we provide a simple KNN search implementation to facilitate calculation of top-k recall results for text-to-graphic and graph-to-text search (tips: If you want to build a search demo in the project, it is recommended to produce graphics and text features based on the Chinese CLIP model and combine the open source engineering framework clip-retrieval to build front-end services.)
For text-to-image search (text recall related images), run the following command:
cd Chinese-CLIP/
split=valid # 指定计算valid或test集特征
python -u cn_clip/eval/make_topk_predictions.py
--image-feats= " ${DATAPATH} /datasets/ ${dataset_name} / ${split} _imgs.img_feat.jsonl "
--text-feats= " ${DATAPATH} /datasets/ ${dataset_name} / ${split} _texts.txt_feat.jsonl "
--top-k=10
--eval-batch-size=32768
--output= " ${DATAPATH} /datasets/ ${dataset_name} / ${split} _predictions.jsonl "The result is saved in the specified jsonl file. Each line represents the top-k image id of a text recall, and the format is as follows:
{ "text_id" : 153915 , "image_ids" : [ 5791244 , 1009692167 , 7454547004 , 3564007203 , 38130571 , 2525270674 , 2195419145 , 2503091968 , 4966265765 , 3690431163 ]}For image-to-text search (image recall related text), similarly, run the following command:
split=valid # 指定计算valid或test集特征
python -u cn_clip/eval/make_topk_predictions_tr.py
--image-feats= " ${DATAPATH} /datasets/ ${dataset_name} / ${split} _imgs.img_feat.jsonl "
--text-feats= " ${DATAPATH} /datasets/ ${dataset_name} / ${split} _texts.txt_feat.jsonl "
--top-k=10
--eval-batch-size=32768
--output= " ${DATAPATH} /datasets/ ${dataset_name} / ${split} _tr_predictions.jsonl "Each line of the output results represents the top-k text id of an image recall, and the format is as follows:
{ "image_id" : 977856234 , "text_ids" : [ 156914 , 157914 , 158914 , 155914 , 156179 , 158907 , 157179 , 154179 , 154914 , 154723 ]}We provide the evaluation script to calculate the Recall@1/5/10 of the search task, and give the mean recall (the average of Recall@1/5/10). Run the following command to get the score:
For text-to-picture search, run the command:
split=valid # 指定计算valid或test集特征
python cn_clip/eval/evaluation.py
${DATAPATH} /datasets/ ${dataset_name} / ${split} _texts.jsonl
${DATAPATH} /datasets/ ${dataset_name} / ${split} _predictions.jsonl
output.json
cat output.jsonFor image-to-text search, please run the following command first to convert the jsonl file marked with the image-to-text format to image-to-text:
python cn_clip/eval/transform_ir_annotation_to_tr.py
--input ${DATAPATH} /datasets/ ${dataset_name} / ${split} _texts.jsonlAfter completion, run the command:
split=valid # 指定计算valid或test集特征
python cn_clip/eval/evaluation_tr.py
${DATAPATH} /datasets/ ${dataset_name} / ${split} _texts.tr.jsonl
${DATAPATH} /datasets/ ${dataset_name} / ${split} _tr_predictions.jsonl
output.json
cat output.jsonThe format of the printed result will be as follows:
{ "success" : true , "score" : 85.67 , "scoreJson" : { "score" : 85.67 , "mean_recall" : 85.67 , "r1" : 71.2 , "r5" : 90.5 , "r10" : 95.3 }}Regarding the training and testing process of cross-modal retrieval, we take the MUGE search dataset (Multimodal E-commerce Graphics and Text Challenge) as an example, and also provides a Jupyter Notebook (download link) that includes all the above processes and can be run. Everyone is welcome to practice it.
This section introduces how to use Chinese-CLIP to implement zero-sample image classification, taking the data set in Benchmark ELEVATER as an example. ELEVATER is an evaluation set composed of multiple well-known classified data sets (including CIFAR-10, CIFAR-100, MNIST, etc.) to evaluate the zero-sample effect of the model on these data sets. In our experiment, we prepared a Chinese version of the propt, category labels and original pictures for each dataset. See the data document for details for this to test the Chinese-CLIP model. For more details about this benchmark, please click on the link. You can also refer to the process we provide to prepare data and test it in your own Chinese classification data set.
First prepare the data in the following format. Since zero-sample image classification only requires testing, you only need to prepare the test set and pre-trained model parameters, and store them under the user-specified ${DATAPATH} according to the following directory structure:
${DATAPATH}
├── pretrained_weights/
└── datasets/
└── ${dataset_name}/
├── label_cn.txt
└── test/
├── 000/ # label id,如label个数大于10,则将其向左补零到3位数保证字典序
│ ├── image_0003.jpg # 图片样本,命名无特殊要求
│ ├── image_0005.jpg
│ └── ...
├── 001/
│ ├── image_0001.jpg
│ ├── image_0002.jpg
│ └── ...
└── 002/
├── image_0003.jpg
├── image_0005.jpg
└── ...
...
The test set ensures that the data in the test folder is divided according to the id corresponding to the label, and ensures that the id is in dictionary order (multiple digits above 10 must be supplemented with zeros to the left label.zfill(3) , such as 001, 002, etc.). label_cn.txt is a data label with a label name per line, as shown below:
手风琴
飞机
锚
...
The label id corresponding to the label of each line is行号-1 , such as the id of the label of the first line is 0, and the id of the label of the second line is 1. If the total number of tags is greater than 10, then zero to three digits will be added to the left, for example, the number of tags is 100, and the tag id will be 000-099 . The user needs to generate the corresponding folder for each label id and put the sample marked with the label into it. We take the CIFAR-100 dataset in ELEVATER as an example, please click the link to download the processed data. If you want to try testing Chinese-CLIP on other datasets contained in ELEVATER, see our data documentation.
We have prepared the prediction script, please check out run_scripts/zeroshot_eval.sh . An example of running a command is as follows:
bash run_scripts/zeroshot_eval.sh 0
${DATAPATH} ${dataset_name}
${vision_model} ${text_model}
${ckpt_path} ${index_file}The meanings of each parameter are:
0 is the GPU idDATAPATH see the preparation section above, enter the corresponding path according to the actual location.dataset_name See the preparation section above and enter the directory name of the dataset for evaluation, such as cifar-100vision_model is the specified model type, and the options include ["ViT-B-32", "ViT-B-16", "ViT-L-14", "ViT-L-14-336", "RN50", "ViT-H-14"]text_model includes ["RoBERTa-wwm-ext-base-chinese", "RoBERTa-wwm-ext-large-chinese", "RBT3-chinese"]ckpt_path is the complete path of ckpt pre-trainedindex_file (optional, only the ELEVATER official website evaluation needs to be specified), please refer to the data document For example, if you use ViT-B/16 scale pre-trained model to evaluate CIFAR-100, then run ( ${DATAPATH} needs to be replaced according to actual conditions):
bash run_scripts/zeroshot_eval.sh 0
${DATAPATH} cifar-100
ViT-B-16 RoBERTa-wwm-ext-base-chinese
${DATAPATH} /pretrained_weights/clip_cn_vit-b-16.ptReturning the result will print the accuracy of top-1.
Result:
zeroshot-top1: 0.6444
On CIFAR-100, the Chinese-CLIP of ViT-B/16 scale should be expected to reach 64.4%. For details, please see Results.md for details on our zero-sample classification results for other scales and other datasets.
At the same time, the program will also save a json file for submitting the official ELEVATER. The content of the json file is as follows:
{ "model_name" : " CN-CLIP-ViT-B-16 " , "dataset_name" : " cifar-100 " , "num_trainable_params" : 0 , "num_params" : 188262913 , "num_visual_params" : 86192640 , "num_backbone_params" : 188262913 , "n_shot" : 0 , "rnd_seeds" : [ 123 ], "predictions" : " prediction probability tensor [size: (1, 10000, 100)] " } These include the meta information of the model name model_name , dataset name dataset_name , total parameter quantity num_params , visual tower parameter quantity num_visual_params and other models, as well as the model output result, that is, the prediction probability tensor of the model, and the size is [1, 样本数, 标签个数] .
Based on our feature extraction API integrated into Huggingface transformers, we provide a demo (Hosted inference API) that can simply try zero-sample image classification online in Huggingface Model Hub. See the demo links for each model scale below. Welcome to try it!
If you think this project is useful, I hope you can give us a star and share it with the users around you. Welcome to the relevant work citation, thank you for your support!
@article{chinese-clip,
title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
journal={arXiv preprint arXiv:2211.01335},
year={2022}
}