ml aim
1.0.0
该存储库是所有目标的入口处,这是一个自回归模型的家族,它们突破了视觉和多模式学习的界限:
Multimodal Autoregressive Pre-training of Large Vision Encoders [ BibTeX ]Scalable Pre-training of Large Autoregressive Image Models [ BibTeX ]*:同等技术贡献
如果您正在寻找原始AIM模型(AIMV1),请在此处参考README。
我们介绍了通过多模式自回归目标预先训练的AIMV2视觉模型家族。 AIMV2预训练是简单明了的训练和有效扩展的。一些AIMV2亮点包括:
我们与社区AIMV2预先训练的能力,培训前决议的预培训的检查站:
AIMv2 with 224px ]AIMv2 with 336px ]AIMv2 with 448px ]AIMv2 with Native Resolution ]AIMv2 distilled ViT-Large ](建议用于多模式应用)Zero-shot Adapted AIMv2 ] 请使用官方安装说明安装Pytorch。之后,将软件包安装为:
pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v1'
pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v2'
我们还为苹果硅的研究和实验提供了MLX后端支持。要启用MLX支持,只需运行:
pip install mlx
from PIL import Image
from aim . v2 . utils import load_pretrained
from aim . v1 . torch . data import val_transforms
img = Image . open (...)
model = load_pretrained ( "aimv2-large-patch14-336" , backend = "torch" )
transform = val_transforms ( img_size = 336 )
inp = transform ( img ). unsqueeze ( 0 )
features = model ( inp ) from PIL import Image
import mlx . core as mx
from aim . v2 . utils import load_pretrained
from aim . v1 . torch . data import val_transforms
img = Image . open (...)
model = load_pretrained ( "aimv2-large-patch14-336" , backend = "mlx" )
transform = val_transforms ( img_size = 336 )
inp = transform ( img ). unsqueeze ( 0 )
inp = mx . array ( inp . numpy ())
features = model ( inp ) from PIL import Image
import jax . numpy as jnp
from aim . v2 . utils import load_pretrained
from aim . v1 . torch . data import val_transforms
img = Image . open (...)
model , params = load_pretrained ( "aimv2-large-patch14-336" , backend = "jax" )
transform = val_transforms ( img_size = 336 )
inp = transform ( img ). unsqueeze ( 0 )
inp = jnp . array ( inp )
features = model . apply ({ "params" : params }, inp )可以通过HuggingFace Hub访问预训练的模型:
from PIL import Image
from transformers import AutoImageProcessor , AutoModel
image = Image . open (...)
processor = AutoImageProcessor . from_pretrained ( "apple/aimv2-large-patch14-336" )
model = AutoModel . from_pretrained ( "apple/aimv2-large-patch14-336" , trust_remote_code = True )
inputs = processor ( images = image , return_tensors = "pt" )
outputs = model ( ** inputs )| model_id | #params | IN-1K | HF链接 | 骨干 |
|---|---|---|---|---|
| AIMV2-LARGE-PATCH14-224 | 0.3b | 86.6 | ?关联 | 关联 |
| AIMV2-HUGE-PATCH14-224 | 0.6b | 87.5 | ?关联 | 关联 |
| AIMV2-1B-PATCH14-224 | 1.2b | 88.1 | ?关联 | 关联 |
| AIMV2-3B-PATCH14-224 | 2.7b | 88.5 | ?关联 | 关联 |
| model_id | #params | IN-1K | HF链接 | 骨干 |
|---|---|---|---|---|
| AIMV2-LARGE-PATCH14-336 | 0.3b | 87.6 | ?关联 | 关联 |
| AIMV2-HUGE-PATCH14-336 | 0.6b | 88.2 | ?关联 | 关联 |
| AIMV2-1B-PATCH14-336 | 1.2b | 88.7 | ?关联 | 关联 |
| AIMV2-3B-PATCH14-336 | 2.7b | 89.2 | ?关联 | 关联 |
| model_id | #params | IN-1K | HF链接 | 骨干 |
|---|---|---|---|---|
| AIMV2-LARGE-PATCH14-448 | 0.3b | 87.9 | ?关联 | 关联 |
| AIMV2-HUGE-PATCH14-448 | 0.6b | 88.6 | ?关联 | 关联 |
| AIMV2-1B-PATCH14-448 | 1.2b | 89.0 | ?关联 | 关联 |
| AIMV2-3B-PATCH14-448 | 2.7b | 89.5 | ?关联 | 关联 |
我们还提供了一个AIMV2-L检查点,该检查点旨在处理各种图像分辨率和宽高比。不管纵横比如何,对图像进行了修补(Patch_size = 14),并将2D正弦位置嵌入添加到线性投射的输入贴片中。该检查点支持[112,4096]范围内的补丁数量。
| model_id | #params | IN-1K | HF链接 | 骨干 |
|---|---|---|---|---|
| AIMV2-LARGE-PATCH14-NENATIDE | 0.3b | 87.3 | ?关联 | 关联 |
我们提供了从AIMV2-3B蒸馏出的AIMV2-L检查点,该检查点为多模式理解基准提供了出色的性能。
| 模型 | VQAV2 | GQA | OKVQA | textvqa | DOCVQA | Infovqa | Chartqa | 科学 | mmep |
|---|---|---|---|---|---|---|---|---|---|
| AIMV2-L | 80.2 | 72.6 | 60.9 | 53.9 | 26.8 | 22.4 | 20.3 | 74.5 | 1457 |
| AIMV2-L diSTLED | 81.1 | 73.0 | 61.4 | 53.5 | 29.2 | 23.3 | 24.0 | 76.3 | 1627年 |
| model_id | #params | res。 | HF链接 | 骨干 |
|---|---|---|---|---|
| AIMV2-LARGE-PATCH14-224-DISTILD | 0.3b | 224px | ?关联 | 关联 |
| AIMV2-LARGE-PATCH14-336延伸 | 0.3b | 336px | ?关联 | 关联 |
在点亮调整后,我们提供AIMV2-L视觉和文本编码器,以实现零击识别。
| 模型 | #params | 1-K中的零射击 | 骨干 |
|---|---|---|---|
| AIMV2-L | 0.3b | 77.0 | 关联 |
如果您发现我们的工作有用,请考虑以我们为:
@misc { fini2024multimodal ,
title = { Multimodal Autoregressive Pre-training of Large Vision Encoders } ,
author = { Enrico Fini and Mustafa Shukor and Xiujun Li and Philipp Dufter and Michal Klein and David Haldimann and Sai Aitharaju and Victor Guilherme Turrisi da Costa and Louis Béthune and Zhe Gan and Alexander T Toshev and Marcin Eichner and Moin Nabi and Yinfei Yang and Joshua M. Susskind and Alaaeldin El-Nouby } ,
year = { 2024 } ,
eprint = { 2411.14402 } ,
archivePrefix = { arXiv } ,
primaryClass = { cs.CV }
} @InProceedings { pmlr-v235-el-nouby24a ,
title = { Scalable Pre-training of Large Autoregressive Image Models } ,
author = { El-Nouby, Alaaeldin and Klein, Michal and Zhai, Shuangfei and Bautista, Miguel '{A}ngel and Shankar, Vaishaal and Toshev, Alexander T and Susskind, Joshua M. and Joulin, Armand } ,
booktitle = { Proceedings of the 41st International Conference on Machine Learning } ,
pages = { 12371--12384 } ,
year = { 2024 } ,
}在使用提供的代码和型号之前,请在存储库许可证上查看。