纸|博客文章:1、2 |演示
欢迎来到我们对DeepMind Flamingo的开源实施!
在此存储库中,我们提供了用于培训和评估OpenFlamingo模型的Pytorch实现。如果您有任何疑问,请随时打开问题。我们也欢迎捐款!
要在现有环境中安装软件包,请运行
pip install open-flamingo
或为运行OpenFlamingo创建Conda环境,运行
conda env create -f environment.yml
要安装培训或评估依赖项,请运行前两个命令之一。要安装所有内容,请运行第三个命令。
pip install open-flamingo[training]
pip install open-flamingo[eval]
pip install open-flamingo[all]
有三个requirements.txt文件:
requirements.txtrequirements-training.txtrequirements-eval.txt根据用例,您可以将其中的任何一个安装在pip install -r <requirements-file.txt>中。基本文件仅包含运行模型所需的依赖项。
我们使用预加压挂钩将格式与存储库中的检查对齐。
pip install pre-commit
brew install pre-commit
pre-commit --version
pre-commit install
然后,每次我们运行git提交时,都会运行检查。如果文件通过挂钩重新格式化,请为您更改的文件添加git add并再次git commit
OpenFlamingo是一种多模式模型,可用于各种任务。它在大型多模式数据集(例如多模式C4)上进行了训练,可用于生成在交织的图像/文本上的文本。例如,OpenFlamingo可用于生成图像的标题,或者生成给定图像和文本段落的问题。这种方法的好处是,我们能够使用内部文化学习快速适应新任务。
OpenFlamingo使用跨注意层结合了预处理的视觉编码器和语言模型。模型架构如下所示。
信用:火烈鸟
我们支持OpenClip软件包中预验证的视觉编码器,其中包括OpenAI验证的型号。我们还支持来自transformers软件包的预读语言模型,例如MPT,Redpajama,Llama,Opt,GPT-Neo,GPT-J,GPT-J和Pythia模型。
from open_flamingo import create_model_and_transforms
model , image_processor , tokenizer = create_model_and_transforms (
clip_vision_encoder_path = "ViT-L-14" ,
clip_vision_encoder_pretrained = "openai" ,
lang_encoder_path = "anas-awadalla/mpt-1b-redpajama-200b" ,
tokenizer_path = "anas-awadalla/mpt-1b-redpajama-200b" ,
cross_attn_every_n_layers = 1 ,
cache_dir = "PATH/TO/CACHE/DIR" # Defaults to ~/.cache
)到目前为止,我们已经培训了以下OpenFlamingo车型。
| #参数 | 语言模型 | 视觉编码器 | XATTN间隔* | 可可4射门苹果酒 | VQAV2 4-shot精度 | 权重 |
|---|---|---|---|---|---|---|
| 3b | Anas-Awadalla/MPT-1B-REDPAJAMA-200B | Openai剪辑VIT-L/14 | 1 | 77.3 | 45.8 | 关联 |
| 3b | Anas-Awadalla/MPT-1B-REDPAJAMA-200B-DOLLY | Openai剪辑VIT-L/14 | 1 | 82.7 | 45.7 | 关联 |
| 4b | 共同计算机/redpajama-incite-base-3b-v1 | Openai剪辑VIT-L/14 | 2 | 81.8 | 49.0 | 关联 |
| 4b | 共同计算机/Redpajama-Incite-Instruct-3B-V1 | Openai剪辑VIT-L/14 | 2 | 85.8 | 49.0 | 关联 |
| 9b | ANAS-AWADALLA/MPT-7B | Openai剪辑VIT-L/14 | 4 | 89.0 | 54.8 | 关联 |
* XATTN间隔是指--cross_attn_every_n_layers参数。
注意:作为我们V2版本的一部分,我们已经弃用了以前的基于Llama的检查点。但是,您可以使用新的代码库继续使用我们的旧检查点。
要使用我们已发布的权重的OpenFlamingo模型实例化,请按上述初始化该模型并使用以下代码。
# grab model checkpoint from huggingface hub
from huggingface_hub import hf_hub_download
import torch
checkpoint_path = hf_hub_download ( "openflamingo/OpenFlamingo-3B-vitl-mpt1b" , "checkpoint.pt" )
model . load_state_dict ( torch . load ( checkpoint_path ), strict = False )下面是生成在交织的图像/文本中生成文本的示例。特别是,让我们尝试几张图像字幕。
from PIL import Image
import requests
import torch
"""
Step 1: Load images
"""
demo_image_one = Image . open (
requests . get (
"http://images.cocodataset.org/val2017/000000039769.jpg" , stream = True
). raw
)
demo_image_two = Image . open (
requests . get (
"http://images.cocodataset.org/test-stuff2017/000000028137.jpg" ,
stream = True
). raw
)
query_image = Image . open (
requests . get (
"http://images.cocodataset.org/test-stuff2017/000000028352.jpg" ,
stream = True
). raw
)
"""
Step 2: Preprocessing images
Details: For OpenFlamingo, we expect the image to be a torch tensor of shape
batch_size x num_media x num_frames x channels x height x width.
In this case batch_size = 1, num_media = 3, num_frames = 1,
channels = 3, height = 224, width = 224.
"""
vision_x = [ image_processor ( demo_image_one ). unsqueeze ( 0 ), image_processor ( demo_image_two ). unsqueeze ( 0 ), image_processor ( query_image ). unsqueeze ( 0 )]
vision_x = torch . cat ( vision_x , dim = 0 )
vision_x = vision_x . unsqueeze ( 1 ). unsqueeze ( 0 )
"""
Step 3: Preprocessing text
Details: In the text we expect an <image> special token to indicate where an image is.
We also expect an <|endofchunk|> special token to indicate the end of the text
portion associated with an image.
"""
tokenizer . padding_side = "left" # For generation padding tokens should be on the left
lang_x = tokenizer (
[ "<image>An image of two cats.<|endofchunk|><image>An image of a bathroom sink.<|endofchunk|><image>An image of" ],
return_tensors = "pt" ,
)
"""
Step 4: Generate text
"""
generated_text = model . generate (
vision_x = vision_x ,
lang_x = lang_x [ "input_ids" ],
attention_mask = lang_x [ "attention_mask" ],
max_new_tokens = 20 ,
num_beams = 3 ,
)
print ( "Generated text: " , tokenizer . decode ( generated_text [ 0 ]))我们在open_flamingo/train中提供培训脚本。我们在open_flamingo/scripts/run_train.py以及以下示例命令中提供了一个示例slurm脚本:
torchrun --nnodes=1 --nproc_per_node=4 open_flamingo/train/train.py
--lm_path anas-awadalla/mpt-1b-redpajama-200b
--tokenizer_path anas-awadalla/mpt-1b-redpajama-200b
--cross_attn_every_n_layers 1
--dataset_resampled
--batch_size_mmc4 32
--batch_size_laion 64
--train_num_samples_mmc4 125000
--train_num_samples_laion 250000
--loss_multiplier_laion 0.2
--workers=4
--run_name OpenFlamingo-3B-vitl-mpt1b
--num_epochs 480
--warmup_steps 1875
--mmc4_textsim_threshold 0.24
--laion_shards "/path/to/shards/shard-{0000..0999}.tar"
--mmc4_shards "/path/to/shards/shard-{0000..0999}.tar"
--report_to_wandb
注意:MPT-1B基础和指示建模代码不接受labels Kwarg或直接在forward()内部计算跨膜片损失,如我们的代码库所预期的那样。我们建议使用此处和此处找到的MPT-1B型号的修改版本。
有关更多详细信息,请参阅我们的培训回复。
示例评估脚本位于open_flamingo/scripts/run_eval.sh 。有关更多详细信息,请参阅我们的评估录像机。
要在OKVQA上运行评估,您将需要运行以下命令:
import nltk
nltk.download('wordnet')
OpenFlamingo的开发是:
Anas Awadalla*, Irena Gao*, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, Ludwig施密特。
该团队主要来自华盛顿大学,斯坦福大学,AI2,UCSB和Google。
该代码基于Lucidrains的Flamingo实施和David Hansmair的Flamingo-Mini Repo。感谢您公开代码!我们还要感谢OpenClip团队使用他们的数据加载代码并从其库设计中汲取灵感。
我们还要感谢Jean-Baptiste Alayrac和Antoine Miech的建议Rohan Taori,Rohan Taori,Nicholas Schiefer,Deep Ganguli,Thomas Liao,Tatsunori Hashimoto和Nicholas carlini,以评估我们发行的安全风险,以培训我们的稳定AI,并为我们提供了稳定的资源,以培训我们的稳定性AI,以培训AI的安全风险。
如果您发现此存储库有用,请考虑引用:
@article{awadalla2023openflamingo,
title={OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models},
author={Anas Awadalla and Irena Gao and Josh Gardner and Jack Hessel and Yusuf Hanafy and Wanrong Zhu and Kalyani Marathe and Yonatan Bitton and Samir Gadre and Shiori Sagawa and Jenia Jitsev and Simon Kornblith and Pang Wei Koh and Gabriel Ilharco and Mitchell Wortsman and Ludwig Schmidt},
journal={arXiv preprint arXiv:2308.01390},
year={2023}
}
@software{anas_awadalla_2023_7733589,
author = {Awadalla, Anas and Gao, Irena and Gardner, Joshua and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Jitsev, Jenia and Kornblith, Simon and Koh, Pang Wei and Ilharco, Gabriel and Wortsman, Mitchell and Schmidt, Ludwig},
title = {OpenFlamingo},
month = mar,
year = 2023,
publisher = {Zenodo},
version = {v0.1.1},
doi = {10.5281/zenodo.7733589},
url = {https://doi.org/10.5281/zenodo.7733589}
}
@article{Alayrac2022FlamingoAV,
title={Flamingo: a Visual Language Model for Few-Shot Learning},
author={Jean-Baptiste Alayrac and Jeff Donahue and Pauline Luc and Antoine Miech and Iain Barr and Yana Hasson and Karel Lenc and Arthur Mensch and Katie Millican and Malcolm Reynolds and Roman Ring and Eliza Rutherford and Serkan Cabi and Tengda Han and Zhitao Gong and Sina Samangooei and Marianne Monteiro and Jacob Menick and Sebastian Borgeaud and Andy Brock and Aida Nematzadeh and Sahand Sharifzadeh and Mikolaj Binkowski and Ricardo Barreira and Oriol Vinyals and Andrew Zisserman and Karen Simonyan},
journal={ArXiv},
year={2022},
volume={abs/2204.14198}
}