open genie open genie

오픈 게니 : Pytorch의 생성 대화 형 환경

이 repo는 Genie 의 비공식적 구현을 포함합니다 : 생성 대화 형 환경 Bruce et al. (2024) Google Deepmind에 의해 소개 된대로.

이 모델의 목표는 "[...] 표지되지 않은 인터넷 비디오에서 감독되지 않은 방식으로 훈련 된 최초의 생성 대화 형 환경"을 소개하는 것입니다.

지니 배너

용법

Genie 모델의 여러 구성 요소를 쉽게 훈련시키기 위해 LightningCLI 인터페이스를 제공합니다. 특히 VideoTokenizer 훈련하려면 다음을 실행해야합니다.

python tokenizer.py train -config < path_to_conf_file >

LatentAction 및 Dynamics 모델을 모두 훈련시키기 위해 (사용으로 사용하여 완전 훈련 된 VideoTokenizer 를 활용할 수 있음) 단순히 실행할 수 있습니다.

python genie.py train -config < path_to_conf_file >

config 폴더에 예제 구성 파일을 제공합니다.

다음 섹션에서는 전체 Genie 모듈을 형성하는 핵심 빌딩 블록에 대한 예제 코드를 제공합니다.

Videotokenizer

Genie는 입력 비디오를 소화하고 encode 을 통해 quantize 능력을 통해 이산 토큰으로 변환하는 VideoTokenizer 에 의존합니다. 이 토큰은 Dynamics 모듈이 잠재 비디오 공간을 조작하기 위해 사용하는 것입니다. VideoTokenizer 모듈은 광범위한 사용자 정의를위한 몇 가지 매개 변수를 수용합니다. 다음은 일반적인 사용을위한 예제 코드입니다.

 from genie import VideoTokenizer

# Pre-assembled description of MagViT2
# encoder & decoder architecture
from genie import MAGVIT2_ENC_DESC
from genie import MAGVIT2_DEC_DESC

tokenizer = VideoTokenizer (
  # We can pass an arbitrary description of the
  # encoder architecture, see genie.tokenizer.get_module
  # to see which module are supported
  enc_desc = (
    'causal' , { # A CausalConv3d layer
        'in_channels' : 3 ,
        'out_channels' : 64 ,
        'kernel_size' : 3 ,
    }),
    ( 'residual' , { # Residual Block
        'in_channels' : 64 ,
        'kernel_size' : 3 ,
        'downsample' : ( 1 , 2 ), # Optional down-scaling (time, space)
        'use_causal' : True , # Using causal padding
        'use_blur' : True , # Using blur-pooling
    }),
    ( 'residual' , {
        'in_channels' : 64 ,
        'out_channels' : 128 , # Output channels can be different
    }),
    ( 'residual' , {
        'n_rep' : 2 , # We can repeat this block N-times
        'in_channels' : 128 ,
    }),
    ( 'residual' , {
        'in_channels' : 128 ,
        'out_channels' : 256 , # We can mix different output channels...
        'kernel_size' : 3 ,
        'downsample' : 2 , # ...with down-sampling (here time=space=2)
        'use_causal' : True ,
    }),
    ( 'proj_out' , { # Output project to quantization module
        'in_channels' : 256 ,
        'out_channels' : 18 ,
        'num_groups' : 8 ,
        'kernel_size' : 3 ,
    }),
  # Save time, use a pre-made configuration!
  dec_desc = MAGVIT2_DEC_DESC ,

  # Description of GAN discriminator
  disc_kwargs = dict (
      # Discriminator parameters
      inp_size = ( 64 , 64 ), # Size of input frames
      model_dim = 64 ,
      dim_mults = ( 1 , 2 , 4 ),    # Channel multipliers
      down_step = ( None , 2 , 2 ), # Down-sampling steps
      inp_channels = 3 ,
      kernel_size = 3 ,
      num_groups = 8 ,
      act_fn = 'leaky' , # Use LeakyReLU as activation function
      use_blur = True ,  # Use BlurPooling for down-sampling
      use_attn = True ,  # Discriminator can have spatial attention
      num_heads = 4 ,    # Number of (spatial) attention heads
      dim_head = 32 ,    # Dimension of each spatial attention heads
  ),

  # Keyword for the LFQ module
  d_codebook = 18 , # Codebook dimension, should match encoder output channels
  n_codebook = 1 , # Support for multiple codebooks
  lfq_bias = True ,
  lfq_frac_sample = 1. ,
  lfq_commit_weight = 0.25 ,
  lfq_entropy_weight = 0.1 ,
  lfq_diversity_weight = 1. ,
  # Keyword for the different loss
  perceptual_model = 'vgg16' , # We pick VGG-16 for perceptual loss
  # Which layer should we record perceptual features from
  perc_feat_layers = ( 'features.6' , 'features.13' , 'features.18' , 'features.25' ),
  gan_discriminate = 'frames' , # GAN discriminator looks at individual frames
  gan_frames_per_batch = 4 ,  # How many frames to extract from each video to use for GAN
  gan_loss_weight = 1. ,
  perc_loss_weight = 1. ,
  quant_loss_weight = 1. ,
)

batch_size = 4
num_channels = 3
num_frames = 16
img_h , img_w = 64 , 64

# Example video tensor
mock_video = torch . randn (
  batch_size ,
  num_channels ,
  num_frames ,
  img_h ,
  img_w
)

# Tokenize input video
tokens , idxs = tokenizer . tokenize ( mock_video )

# Tokenized video has shape:
# (batch_size, d_codebook, num_frames // down_time, H // down_space, W // down_space)

# To decode the video from tokens use:
rec_video = tokenizer . decode ( tokens )

# To train the tokenizer (do many! times)
loss , aux_losses = tokenizer ( mock_video )
loss . backward ()

잠재적 인 행동 모델

Genie는 잠재적 인 행동의 (개별) 코드북을 공식화하는 유일한 작업 인 LatentAction 모델을 구현합니다. 이 코드북은 해석 가능한 동작 (예 : MOVE_RIGHT )을 장려하기 위해 설계별로 작습니다. 이러한 코드북을 훈련시키기 위해 LatentAction 모델은 VQ-VAE 모델로 빌드되며, 여기서 인코더가 비디오 (픽셀) 프레임을 섭취하고 잠재적으로 (양자화 된) 동작을 생성합니다. 그런 다음 디코더는 이전 프레임 기록과 다음 프레임을 예측하기위한 현재 동작을 섭취합니다. 사용자가 조치를 제공 할 때 인코더와 디코더는 추론 시간에 폐기됩니다.

LatentAction 모델은 인코더/디코더 아키텍처를 Blueprint 통해 지정할 수있는 VideoTokenizer 와 유사한 디자인을 따릅니다. 다음은 핵심 구성 요소를 강조하는 예제 코드입니다.

 from genie import LatentAction
from genie import LATENT_ACT_ENC

model = LatentAction (
  # Use a pre-made configuration...
  enc_desc = LATENT_ACT_ENC ,
  # ...Or specify a brand-new one
  dec_desc = (
    # Latent Action uses space-time transformer
    ( 'space-time_attn' , {
        'n_rep' : 2 ,
        'n_embd' : 256 ,
        'n_head' : 4 ,
        'd_head' : 16 ,
        'has_ext' : True ,
        # Decoder uses latent action as external
        # conditioning for decoding!
        'time_attn_kw'  : { 'key_dim' : 8 },
    }),
    # But we can also down/up-sample to manage resources
    # NOTE: Encoder & Decoder should work nicely together
    #       so that down/up-samples cancel out
    ( 'spacetime_upsample' , {
        'in_channels' : 256 ,
        'kernel_size' : 3 ,
        'time_factor' : 1 ,
        'space_factor' : 2 ,
    }),
    ( 'space-time_attn' , {
        'n_rep' : 2 ,
        'n_embd' : 256 ,
        'n_head' : 4 ,
        'd_head' : 16 ,
        'has_ext' : True ,
        'time_attn_kw'  : { 'key_dim' : 8 },
    }),
  ),
  d_codebook = 8 ,       # Small codebook to incentivize interpretability
  inp_channels = 3 ,     # Input video channel
  inp_shape = ( 64 , 64 ), # Spatial frame dimensions
  n_embd = 256 ,         # Hidden model dimension
  # [...] Other kwargs for controlling LFQ module behavior
)

# Create mock input video
batch_size = 2
video_len = 16
frame_dim = 64 , 64

video = torch . randn ( batch_size , 3 , video_len , * frame_dim )

# Encode the video to extract the latent actions
( actions , encoded ), quant_loss = model . encode ( video )

# Compute the reconstructed video and its loss
recon , loss , aux_losses = model ( video )

# This should work!
assert recon . shape == ( batch_size , 3 , video_len , * frame_dim )

# Train the model
loss . backward ()

역학 모델

DynamicsModel 과거 비디오 토큰 및 잠재적 인 액션 이력을 기반으로 다음 비디오 토큰을 예측해야합니다. 아키텍처는 Chang et al, (2022)의 MaskGIT 모델을 기반으로합니다. 다음은 핵심 구성 요소를 강조하는 예제 코드입니다.

 from genie import DynamicsModel

blueprint = (
  # Describe a Space-Time Transformer
  ( 'space-time_attn' , {
      'n_rep' : 4 ,     # Number of layers
      'n_embd' : 256 ,  # Hidden dimension
      'n_head' : 4 ,    # Number of attention heads
      'd_head' : 16 ,   # Dimension of each attention head
      'transpose' : False ,
  }),
)

# Create the model
tok_codebook = 16 # Dimension of video tokenizer codebook
act_codebook =  4 # Dimension of latent action codebook
dynamics = DynamicsModel (
    desc = blueprint ,
    tok_vocab = tok_codebook ,
    act_vocab = act_codebook ,
    embed_dim = 256 ,          # Hidden dimension of the model
)

batch_size = 2
num_frames = 16
img_size   = 32

# Create mock token and latent action inputs
mock_tokens = torch . randint ( 0 , tok_codebook , ( batch_size , num_frames , img_size , img_size ))
mock_act_id = torch . randint ( 0 , act_codebook , ( batch_size , num_frames ))

# Compute the reconstruction loss based on Bernoulli
# masking of input tokens
loss = dynamics . compute_loss (
    mock_tokens ,
    mock_act_id ,
)

# Generate the next video token
new_tokens = dynamics . generate (
    mock_tokens ,
    mock_act_id ,
    steps = 5 , # Number of MaskGIT sampling steps
)

assert new_tokes . shape == ( batch_size , num_frame + 1 , img_size , img_size )

로드맵

비디오 토로이저를 구현하십시오. Yu et al. (2023)에 설명 된대로 MAGVIT-2 토큰 화기를 사용하십시오.
벡터 용량 화 된 ST- 변환기 인 잠재적 인 동작 모델을 구현하십시오. 과거 비디오 프레임에서 게임 액션을 예측하십시오.
과거의 프레임과 동작을 취하고 새로운 비디오 프레임을 생성하는 Dynamics 모델을 구현하십시오.
기능 교육 스크립트 (Lightning) 추가.
결과를 보여줍니다.

요구 사항

코드는 Python 3.11+로 테스트되었으며 torch 2.0+ 필요합니다 (빠른 플래시 항목 사용으로 인해). 필요한 종속성을 설치하려면 단순히 pip install -r requirements.txt 실행합니다.

인용

이 repo는 Lucidrains의 아름다운 Magvit 구현과 Valeoai의 Maskgit 구현을 기반으로합니다.

 @article { bruce2024genie ,
  title = { Genie: Generative Interactive Environments } ,
  author = { Bruce, Jake and Dennis, Michael and Edwards, Ashley and Parker-Holder, Jack and Shi, Yuge and Hughes, Edward and Lai, Matthew and Mavalankar, Aditi and Steigerwald, Richie and Apps, Chris and others } ,
  journal = { arXiv preprint arXiv:2402.15391 } ,
  year = { 2024 }
}

 @article { yu2023language ,
  title = { Language Model Beats Diffusion--Tokenizer is Key to Visual Generation } ,
  author = { Yu, Lijun and Lezama, Jos{'e} and Gundavarapu, Nitesh B and Versari, Luca and Sohn, Kihyuk and Minnen, David and Cheng, Yong and Gupta, Agrim and Gu, Xiuye and Hauptmann, Alexander G and others } ,
  journal = { arXiv preprint arXiv:2310.05737 } ,
  year = { 2023 }
}

 @inproceedings { chang2022maskgit ,
  title = { Maskgit: Masked generative image transformer } ,
  author = { Chang, Huiwen and Zhang, Han and Jiang, Lu and Liu, Ce and Freeman, William T } ,
  booktitle = { Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition } ,
  pages = { 11315--11325 } ,
  year = { 2022 }
}