ScreenAI
1.0.0

從本文中實施Screenai模型:“用於UI和信息圖表理解的視覺語言模型”。流量為:IMG +文本 - >貼片大小 - > vit->嵌入式 + concat-> attn + ffn-> cross attn + ffn + ffn + self attn-> to out。紙鏈接:
pip3 install screenai
import torch
from screenai . main import ScreenAI
# Create a tensor for the image
image = torch . rand ( 1 , 3 , 224 , 224 )
# Create a tensor for the text
text = torch . randn ( 1 , 1 , 512 )
# Create an instance of the ScreenAI model with specified parameters
model = ScreenAI (
patch_size = 16 ,
image_size = 224 ,
dim = 512 ,
depth = 6 ,
heads = 8 ,
vit_depth = 4 ,
multi_modal_encoder_depth = 4 ,
llm_decoder_depth = 4 ,
mm_encoder_ff_mult = 4 ,
)
# Perform forward pass of the model with the given text and image tensors
out = model ( text , image )
# Print the shape of the output tensor
print ( out )
麻省理工學院
@misc { baechler2024screenai ,
title = { ScreenAI: A Vision-Language Model for UI and Infographics Understanding } ,
author = { Gilles Baechler and Srinivas Sunkara and Maria Wang and Fedir Zubach and Hassan Mansoor and Vincent Etter and Victor Cărbune and Jason Lin and Jindong Chen and Abhanshu Sharma } ,
year = { 2024 } ,
eprint = { 2402.04615 } ,
archivePrefix = { arXiv } ,
primaryClass = { cs.CV }
}