ml aim -Download - ml aim -Quellcode -Download

ml aim

Python

1.0.0

Herunterladen

Autoregressive Vorverschiebung großer Sehgeräte

Dieses Repository ist der Einstiegspunkt für alle Ziele, eine Familie autoregressiver Modelle, die die Grenzen des visuellen und multimodalen Lernens überschreiten:

AIMV2 : Multimodal Autoregressive Pre-training of Large Vision Encoders [ BibTeX ]
Enrico Fini*, Mustafa Shukor*, Xiujun li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander Toshev, Marcin Eichner, Moinfi, Yinfei, Joshua M. El-nouBy*
AIMV1 : Scalable Pre-training of Large Autoregressive Image Models [ BibTeX ]
Alaetz El-NouBy, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, Armand Joulin.

*: Gleicher technischer Beitrag

Wenn Sie nach dem ursprünglichen AIM -Modell (AIMV1) suchen, finden Sie hier die Readme.

Überblick über AIMV2

Wir stellen die AIMV2-Familie von Visionsmodellen vor, die mit einem multimodalen autoregressiven Ziel ausgebildet sind. AIMV2-Vorausbildung ist einfach und unkompliziert zu trainieren und effektiv zu skalieren. Einige AIMV2 -Highlights umfassen:

Übertrifft OAI -Clip und Siglip auf den größten Teil der multimodalen Verständnis -Benchmarks.
Übertrifft Dinov2 bei Open-Vocabularary-Objekterkennung und überweisen Expressionsverständnis.
Zeigt eine starke Erkennungsleistung, wobei AIMV2-3B mit einem gefrorenen Koffer über 89,5% auf ImageNet erzielt wird.

Wir teilen uns mit den vorgeschriebenen Community AIMV2-Checkpoints unterschiedlicher Kapazitäten und Vorlösungen vor dem Training:

[ AIMv2 with 224px ]
[ AIMv2 with 336px ]
[ AIMv2 with 448px ]
[ AIMv2 with Native Resolution ]
[ AIMv2 distilled ViT-Large ] ( empfohlen für multimodale Anwendungen )
[ Zero-shot Adapted AIMv2 ]

Installation

Bitte installieren Sie PyTorch mit den offiziellen Installationsanweisungen. Installieren Sie anschließend das Paket als:

 pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v1'
pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v2'

Wir bieten auch MLX -Backend -Unterstützung für Forschung und Experimente mit Apple Silicon. Um MLX -Unterstützung zu aktivieren, rennen Sie einfach:

 pip install mlx

Beispiele

Mit Pytorch

 from PIL import Image

from aim . v2 . utils import load_pretrained
from aim . v1 . torch . data import val_transforms

img = Image . open (...)
model = load_pretrained ( "aimv2-large-patch14-336" , backend = "torch" )
transform = val_transforms ( img_size = 336 )

inp = transform ( img ). unsqueeze ( 0 )
features = model ( inp )

Mit MLX

 from PIL import Image
import mlx . core as mx

from aim . v2 . utils import load_pretrained
from aim . v1 . torch . data import val_transforms

img = Image . open (...)
model = load_pretrained ( "aimv2-large-patch14-336" , backend = "mlx" )
transform = val_transforms ( img_size = 336 )

inp = transform ( img ). unsqueeze ( 0 )
inp = mx . array ( inp . numpy ())
features = model ( inp )

Mit Jax

 from PIL import Image
import jax . numpy as jnp

from aim . v2 . utils import load_pretrained
from aim . v1 . torch . data import val_transforms

img = Image . open (...)
model , params = load_pretrained ( "aimv2-large-patch14-336" , backend = "jax" )
transform = val_transforms ( img_size = 336 )

inp = transform ( img ). unsqueeze ( 0 )
inp = jnp . array ( inp )
features = model . apply ({ "params" : params }, inp )

Vorausgebildete Kontrollpunkte

Auf die vorgeborenen Modelle kann über Huggingface Hub zugegriffen werden wie:

 from PIL import Image
from transformers import AutoImageProcessor , AutoModel

image = Image . open (...)
processor = AutoImageProcessor . from_pretrained ( "apple/aimv2-large-patch14-336" )
model = AutoModel . from_pretrained ( "apple/aimv2-large-patch14-336" , trust_remote_code = True )

inputs = processor ( images = image , return_tensors = "pt" )
outputs = model ( ** inputs )

AIMV2 mit 224px

model_id	#params	IN-1K	HF -Link	Rückgrat
AIMV2-Large-Patch14-224	0,3b	86.6	?Link	Link
AIMV2-HUGE-PATCH14-224	0,6b	87,5	?Link	Link
AIMV2-1B-PATCH14-224	1.2b	88.1	?Link	Link
AIMV2-3B-PATCH14-224	2.7b	88,5	?Link	Link

AIMV2 mit 336px

model_id	#params	IN-1K	HF -Link	Rückgrat
AIMV2-Large-Patch14-336	0,3b	87.6	?Link	Link
AIMV2-HUGE-PATCH14-336	0,6b	88.2	?Link	Link
AIMV2-1B-PATCH14-336	1.2b	88.7	?Link	Link
AIMV2-3B-PATCH14-336	2.7b	89,2	?Link	Link

AIMV2 mit 448px

model_id	#params	IN-1K	HF -Link	Rückgrat
AIMV2-Large-Patch14-448	0,3b	87,9	?Link	Link
AIMV2-HUGE-PATCH14-448	0,6b	88,6	?Link	Link
AIMV2-1B-PATCH14-448	1.2b	89.0	?Link	Link
AIMV2-3B-PATCH14-448	2.7b	89,5	?Link	Link

AIMV2 mit einheimischer Auflösung

Wir bieten zusätzlich einen AIMV2-L-Checkpoint, der beendet ist, um eine Vielzahl von Bildauflösungen und Seitenverhältnissen zu verarbeiten. Unabhängig vom Seitenverhältnis wird das Bild gepatifiziert (patch_size = 14) und eine 2D -Sinus -Positions -Einbettung zu den linear projizierten Eingangs Patches hinzugefügt. Dieser Checkpoint unterstützt die Anzahl der Patches im Bereich von [112, 4096] .

model_id	#params	IN-1K	HF -Link	Rückgrat
AIMV2-large-patch14-nativ	0,3b	87,3	?Link	Link

AIMV2 Destilliertes Vit-Large

Wir bieten einen von AIMV2-3B destillierten AIMV2-L-Kontrollpunkt, der eine bemerkenswerte Leistung für multimodale Verständnis-Benchmarks bietet.

Modell	VQAV2	GQA	OKVQA	Textvqa	Docvqa	Infovqa	Chartqa	Sciqa	Mmep
AIMV2-l	80.2	72.6	60.9	53.9	26.8	22.4	20.3	74,5	1457
AIMV2-L destilliert	81.1	73.0	61.4	53,5	29.2	23.3	24.0	76,3	1627

model_id	#params	Res.	HF -Link	Rückgrat
AIMV2-large-patch14-224-destilliert	0,3b	224px	?Link	Link
AIMV2-large-patch14-336-destilliert	0,3b	336px	?Link	Link

Null-Shot-angepasste AIMV2

Wir bieten die AIMV2-L-Vision- und Text-Encoder nach der Beleuchtung, um die Erkennung von Null-Shots zu ermöglichen.

Modell	#params	Null-Shot in1-k	Rückgrat
AIMV2-l	0,3b	77.0	Link

Zitat

Wenn Sie unsere Arbeit nützlich finden, sollten Sie uns als:

AIMV2 BIBTEX

 @misc { fini2024multimodal ,
    title = { Multimodal Autoregressive Pre-training of Large Vision Encoders } ,
    author = { Enrico Fini and Mustafa Shukor and Xiujun Li and Philipp Dufter and Michal Klein and David Haldimann and Sai Aitharaju and Victor Guilherme Turrisi da Costa and Louis Béthune and Zhe Gan and Alexander T Toshev and Marcin Eichner and Moin Nabi and Yinfei Yang and Joshua M. Susskind and Alaaeldin El-Nouby } ,
    year = { 2024 } ,
    eprint = { 2411.14402 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

AIMV1 BIBTEX

 @InProceedings { pmlr-v235-el-nouby24a ,
  title     = { Scalable Pre-training of Large Autoregressive Image Models } ,
  author    = { El-Nouby, Alaaeldin and Klein, Michal and Zhai, Shuangfei and Bautista, Miguel '{A}ngel and Shankar, Vaishaal and Toshev, Alexander T and Susskind, Joshua M. and Joulin, Armand } ,
  booktitle = { Proceedings of the 41st International Conference on Machine Learning } ,
  pages     = { 12371--12384 } ,
  year      = { 2024 } ,
}