ml aim
1.0.0
該存儲庫是所有目標的入口處,這是一個自回歸模型的家族,它們突破了視覺和多模式學習的界限:
Multimodal Autoregressive Pre-training of Large Vision Encoders [ BibTeX ]Scalable Pre-training of Large Autoregressive Image Models [ BibTeX ]*:同等技術貢獻
如果您正在尋找原始AIM模型(AIMV1),請在此處參考README。
我們介紹了通過多模式自回歸目標預先訓練的AIMV2視覺模型家族。 AIMV2預訓練是簡單明了的訓練和有效擴展的。一些AIMV2亮點包括:
我們與社區AIMV2預先訓練的能力,培訓前決議的預培訓的檢查站:
AIMv2 with 224px ]AIMv2 with 336px ]AIMv2 with 448px ]AIMv2 with Native Resolution ]AIMv2 distilled ViT-Large ](建議用於多模式應用)Zero-shot Adapted AIMv2 ] 請使用官方安裝說明安裝Pytorch。之後,將軟件包安裝為:
pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v1'
pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v2'
我們還為蘋果矽的研究和實驗提供了MLX後端支持。要啟用MLX支持,只需運行:
pip install mlx
from PIL import Image
from aim . v2 . utils import load_pretrained
from aim . v1 . torch . data import val_transforms
img = Image . open (...)
model = load_pretrained ( "aimv2-large-patch14-336" , backend = "torch" )
transform = val_transforms ( img_size = 336 )
inp = transform ( img ). unsqueeze ( 0 )
features = model ( inp ) from PIL import Image
import mlx . core as mx
from aim . v2 . utils import load_pretrained
from aim . v1 . torch . data import val_transforms
img = Image . open (...)
model = load_pretrained ( "aimv2-large-patch14-336" , backend = "mlx" )
transform = val_transforms ( img_size = 336 )
inp = transform ( img ). unsqueeze ( 0 )
inp = mx . array ( inp . numpy ())
features = model ( inp ) from PIL import Image
import jax . numpy as jnp
from aim . v2 . utils import load_pretrained
from aim . v1 . torch . data import val_transforms
img = Image . open (...)
model , params = load_pretrained ( "aimv2-large-patch14-336" , backend = "jax" )
transform = val_transforms ( img_size = 336 )
inp = transform ( img ). unsqueeze ( 0 )
inp = jnp . array ( inp )
features = model . apply ({ "params" : params }, inp )可以通過HuggingFace Hub訪問預訓練的模型:
from PIL import Image
from transformers import AutoImageProcessor , AutoModel
image = Image . open (...)
processor = AutoImageProcessor . from_pretrained ( "apple/aimv2-large-patch14-336" )
model = AutoModel . from_pretrained ( "apple/aimv2-large-patch14-336" , trust_remote_code = True )
inputs = processor ( images = image , return_tensors = "pt" )
outputs = model ( ** inputs )| model_id | #params | IN-1K | HF鏈接 | 骨幹 |
|---|---|---|---|---|
| AIMV2-LARGE-PATCH14-224 | 0.3b | 86.6 | ?關聯 | 關聯 |
| AIMV2-HUGE-PATCH14-224 | 0.6b | 87.5 | ?關聯 | 關聯 |
| AIMV2-1B-PATCH14-224 | 1.2b | 88.1 | ?關聯 | 關聯 |
| AIMV2-3B-PATCH14-224 | 2.7b | 88.5 | ?關聯 | 關聯 |
| model_id | #params | IN-1K | HF鏈接 | 骨幹 |
|---|---|---|---|---|
| AIMV2-LARGE-PATCH14-336 | 0.3b | 87.6 | ?關聯 | 關聯 |
| AIMV2-HUGE-PATCH14-336 | 0.6b | 88.2 | ?關聯 | 關聯 |
| AIMV2-1B-PATCH14-336 | 1.2b | 88.7 | ?關聯 | 關聯 |
| AIMV2-3B-PATCH14-336 | 2.7b | 89.2 | ?關聯 | 關聯 |
| model_id | #params | IN-1K | HF鏈接 | 骨幹 |
|---|---|---|---|---|
| AIMV2-LARGE-PATCH14-448 | 0.3b | 87.9 | ?關聯 | 關聯 |
| AIMV2-HUGE-PATCH14-448 | 0.6b | 88.6 | ?關聯 | 關聯 |
| AIMV2-1B-PATCH14-448 | 1.2b | 89.0 | ?關聯 | 關聯 |
| AIMV2-3B-PATCH14-448 | 2.7b | 89.5 | ?關聯 | 關聯 |
我們還提供了一個AIMV2-L檢查點,該檢查點旨在處理各種圖像分辨率和寬高比。不管縱橫比如何,對圖像進行了修補(Patch_size = 14),並將2D正弦位置嵌入添加到線性投射的輸入貼片中。該檢查點支持[112,4096]範圍內的補丁數量。
| model_id | #params | IN-1K | HF鏈接 | 骨幹 |
|---|---|---|---|---|
| AIMV2-LARGE-PATCH14-NENATIDE | 0.3b | 87.3 | ?關聯 | 關聯 |
我們提供了從AIMV2-3B蒸餾出的AIMV2-L檢查點,該檢查點為多模式理解基準提供了出色的性能。
| 模型 | VQAV2 | GQA | OKVQA | textvqa | DOCVQA | Infovqa | Chartqa | 科學 | mmep |
|---|---|---|---|---|---|---|---|---|---|
| AIMV2-L | 80.2 | 72.6 | 60.9 | 53.9 | 26.8 | 22.4 | 20.3 | 74.5 | 1457 |
| AIMV2-L diSTLED | 81.1 | 73.0 | 61.4 | 53.5 | 29.2 | 23.3 | 24.0 | 76.3 | 1627年 |
| model_id | #params | res。 | HF鏈接 | 骨幹 |
|---|---|---|---|---|
| AIMV2-LARGE-PATCH14-224-DISTILD | 0.3b | 224px | ?關聯 | 關聯 |
| AIMV2-LARGE-PATCH14-336延伸 | 0.3b | 336px | ?關聯 | 關聯 |
在點亮調整後,我們提供AIMV2-L視覺和文本編碼器,以實現零擊識別。
| 模型 | #params | 1-K中的零射擊 | 骨幹 |
|---|---|---|---|
| AIMV2-L | 0.3b | 77.0 | 關聯 |
如果您發現我們的工作有用,請考慮以我們為:
@misc { fini2024multimodal ,
title = { Multimodal Autoregressive Pre-training of Large Vision Encoders } ,
author = { Enrico Fini and Mustafa Shukor and Xiujun Li and Philipp Dufter and Michal Klein and David Haldimann and Sai Aitharaju and Victor Guilherme Turrisi da Costa and Louis Béthune and Zhe Gan and Alexander T Toshev and Marcin Eichner and Moin Nabi and Yinfei Yang and Joshua M. Susskind and Alaaeldin El-Nouby } ,
year = { 2024 } ,
eprint = { 2411.14402 } ,
archivePrefix = { arXiv } ,
primaryClass = { cs.CV }
} @InProceedings { pmlr-v235-el-nouby24a ,
title = { Scalable Pre-training of Large Autoregressive Image Models } ,
author = { El-Nouby, Alaaeldin and Klein, Michal and Zhai, Shuangfei and Bautista, Miguel '{A}ngel and Shankar, Vaishaal and Toshev, Alexander T and Susskind, Joshua M. and Joulin, Armand } ,
booktitle = { Proceedings of the 41st International Conference on Machine Learning } ,
pages = { 12371--12384 } ,
year = { 2024 } ,
}在使用提供的代碼和型號之前,請在存儲庫許可證上查看。