Descargar vector quantize pytorch - vector quantize pytorch Code Source Code Descargar

Cuantización vectorial - Pytorch

Una biblioteca de cuantificación vectorial transcrita originalmente a partir de la implementación de TensorFlow de DeepMind, que se realiza convenientemente en un paquete. Utiliza promedios móvil exponenciales para actualizar el diccionario.

VQ ha sido utilizado con éxito por DeepMind y OpenAI para una generación de imágenes de alta calidad (VQ-VAE-2) y música (Jukebox).

Instalar

$ pip install vector-quantize-pytorch

Uso

 import torch
from vector_quantize_pytorch import VectorQuantize

vq = VectorQuantize (
    dim = 256 ,
    codebook_size = 512 ,     # codebook size
    decay = 0.8 ,             # the exponential moving average decay, lower means the dictionary will change faster
    commitment_weight = 1.   # the weight on the commitment loss
)

x = torch . randn ( 1 , 1024 , 256 )
quantized , indices , commit_loss = vq ( x ) # (1, 1024, 256), (1, 1024), (1)

VQ residual

Este documento propone utilizar múltiples cuantizadores de vectores para cuantificar recursivamente los residuos de la forma de onda. Puede usar esto con la clase ResidualVQ y un parámetro de inicialización adicional.

 import torch
from vector_quantize_pytorch import ResidualVQ

residual_vq = ResidualVQ (
    dim = 256 ,
    num_quantizers = 8 ,      # specify number of quantizers
    codebook_size = 1024 ,    # codebook size
)

x = torch . randn ( 1 , 1024 , 256 )

quantized , indices , commit_loss = residual_vq ( x )
print ( quantized . shape , indices . shape , commit_loss . shape )
# (1, 1024, 256), (1, 1024, 8), (1, 8)

# if you need all the codes across the quantization layers, just pass return_all_codes = True

quantized , indices , commit_loss , all_codes = residual_vq ( x , return_all_codes = True )

# (8, 1, 1024, 256)

Además, este documento utiliza residual-VQ para construir el RQ-VAE, para generar imágenes de alta resolución con códigos más comprimidos.

Hacen dos modificaciones. El primero es compartir el libro de códigos en todos los cuantizadores. El segundo es probar estocásticamente los códigos en lugar de siempre tomar la coincidencia más cercana. Puede usar ambas funciones con dos argumentos de palabras clave adicionales.

 import torch
from vector_quantize_pytorch import ResidualVQ

residual_vq = ResidualVQ (
    dim = 256 ,
    num_quantizers = 8 ,
    codebook_size = 1024 ,
    stochastic_sample_codes = True ,
    sample_codebook_temp = 0.1 ,         # temperature for stochastically sampling codes, 0 would be equivalent to non-stochastic
    shared_codebook = True              # whether to share the codebooks for all quantizers or not
)

x = torch . randn ( 1 , 1024 , 256 )
quantized , indices , commit_loss = residual_vq ( x )

# (1, 1024, 256), (1, 1024, 8), (1, 8)

Un artículo reciente propone además hacer VQ residual en grupos de la dimensión de características, que muestra resultados equivalentes a Encodec mientras usa muchos menos libros de códigos. Puede usarlo importando GroupedResidualVQ

 import torch
from vector_quantize_pytorch import GroupedResidualVQ

residual_vq = GroupedResidualVQ (
    dim = 256 ,
    num_quantizers = 8 ,      # specify number of quantizers
    groups = 2 ,
    codebook_size = 1024 ,    # codebook size
)

x = torch . randn ( 1 , 1024 , 256 )

quantized , indices , commit_loss = residual_vq ( x )

# (1, 1024, 256), (2, 1, 1024, 8), (2, 1, 8)

Inicialización

El documento SoundStream propone que el libro de códigos debe ser inicializado por los centroides de Kmeans del primer lote. Puede activar fácilmente esta función con una bandera kmeans_init = True , para la clase VectorQuantize o ResidualVQ

 import torch
from vector_quantize_pytorch import ResidualVQ

residual_vq = ResidualVQ (
    dim = 256 ,
    codebook_size = 256 ,
    num_quantizers = 4 ,
    kmeans_init = True ,   # set to True
    kmeans_iters = 10     # number of kmeans iterations to calculate the centroids for the codebook on init
)

x = torch . randn ( 1 , 1024 , 256 )
quantized , indices , commit_loss = residual_vq ( x )

# (1, 1024, 256), (1, 1024, 4), (1, 4)

Cálculo de gradiente

Los VQ-VAS se entrenan tradicionalmente con el estimador directo (STE). Durante el pase hacia atrás, el gradiente fluye alrededor de la capa VQ en lugar de a través de ella. El papel de truco de rotación propone transformar el gradiente a través de la capa VQ para que el ángulo relativo y la magnitud entre el vector de entrada y la salida cuantificada se codifiquen en el gradiente. Puede habilitar o deshabilitar esta característica con rotation_trick=True/False en la clase VectorQuantize .

 from vector_quantize_pytorch import VectorQuantize

vq_layer = VectorQuantize (
    dim = 256 ,
    codebook_size = 256 ,
    rotation_trick = True ,   # Set to False to use the STE gradient estimator or True to use the rotation trick.
)

Aumento del uso del libro de códigos

Este repositorio contendrá algunas técnicas de varios documentos para combatir las entradas del libro de códigos "muerto", lo cual es un problema común al usar cuantizadores vectoriales.

Dimensión de libro de códigos inferior

El documento VQGAN mejorado propone que el libro de códigos se mantenga en una dimensión más baja. Los valores del codificador se proyectan antes de ser proyectado a alta dimensión después de la cuantización. Puede configurar esto con el hiperparámetro codebook_dim .

 import torch
from vector_quantize_pytorch import VectorQuantize

vq = VectorQuantize (
    dim = 256 ,
    codebook_size = 256 ,
    codebook_dim = 16      # paper proposes setting this to 32 or as low as 8 to increase codebook usage
)

x = torch . randn ( 1 , 1024 , 256 )
quantized , indices , commit_loss = vq ( x )

# (1, 1024, 256), (1, 1024), (1,)

Similitud de coseno

El papel VQGAN mejorado también propone normalizar los códigos y los vectores codificados, lo que se reduce a usar similitud de coseno para la distancia. Afirman que aplicar los vectores en una esfera conduce a mejoras en el uso del código y la reconstrucción posterior. Puede activar esto configurando use_cosine_sim = True

 import torch
from vector_quantize_pytorch import VectorQuantize

vq = VectorQuantize (
    dim = 256 ,
    codebook_size = 256 ,
    use_cosine_sim = True   # set this to True
)

x = torch . randn ( 1 , 1024 , 256 )
quantized , indices , commit_loss = vq ( x )

# (1, 1024, 256), (1, 1024), (1,)

Códigos rancios que vencen

Finalmente, el papel SoundStream tiene un esquema en el que reemplazan los códigos que tienen golpes por debajo de un cierto umbral con un vector seleccionado al azar desde el lote actual. Puede establecer este umbral con la palabra clave threshold_ema_dead_code .

 import torch
from vector_quantize_pytorch import VectorQuantize

vq = VectorQuantize (
    dim = 256 ,
    codebook_size = 512 ,
    threshold_ema_dead_code = 2  # should actively replace any codes that have an exponential moving average cluster size less than 2
)

x = torch . randn ( 1 , 1024 , 256 )
quantized , indices , commit_loss = vq ( x )

# (1, 1024, 256), (1, 1024), (1,)

Pérdida de regularización ortogonal

VQ-VAE / VQ-Gan está ganando rápidamente popularidad. Un artículo reciente propone que al usar cuantización vectorial en las imágenes, hacer cumplir el libro de códigos como ortogonal conduce a la equivalencia de traducción de los códigos discretizados, lo que lleva a grandes mejoras en el texto aguas abajo a las tareas de generación de imágenes.

Puede usar esta función simplemente estableciendo el orthogonal_reg_weight para que sea mayor que 0 , en cuyo caso la regularización ortogonal se agregará a la pérdida auxiliaria que se supera por el módulo.

 import torch
from vector_quantize_pytorch import VectorQuantize

vq = VectorQuantize (
    dim = 256 ,
    codebook_size = 256 ,
    accept_image_fmap = True ,                   # set this true to be able to pass in an image feature map
    orthogonal_reg_weight = 10 ,                 # in paper, they recommended a value of 10
    orthogonal_reg_max_codes = 128 ,             # this would randomly sample from the codebook for the orthogonal regularization loss, for limiting memory usage
    orthogonal_reg_active_codes_only = False    # set this to True if you have a very large codebook, and would only like to enforce the loss on the activated codes per batch
)

img_fmap = torch . randn ( 1 , 256 , 32 , 32 )
quantized , indices , loss = vq ( img_fmap ) # (1, 256, 32, 32), (1, 32, 32), (1,)

# loss now contains the orthogonal regularization loss with the weight as assigned

VQ de múltiples cabezas

Ha habido una serie de documentos que proponen variantes de representaciones latentes discretas con un enfoque de múltiples cabezas (múltiples códigos por característica). He decidido ofrecer una variante donde se usa el mismo libro de códigos para cuantificar vectorial en los tiempos head dimensión de entrada.

También puede utilizar un enfoque más probado (memcodes) de NWT Paper

 import torch
from vector_quantize_pytorch import VectorQuantize

vq = VectorQuantize (
    dim = 256 ,
    codebook_dim = 32 ,                  # a number of papers have shown smaller codebook dimension to be acceptable
    heads = 8 ,                          # number of heads to vector quantize, codebook shared across all heads
    separate_codebook_per_head = True ,  # whether to have a separate codebook per head. False would mean 1 shared codebook
    codebook_size = 8196 ,
    accept_image_fmap = True
)

img_fmap = torch . randn ( 1 , 256 , 32 , 32 )
quantized , indices , loss = vq ( img_fmap )

# (1, 256, 32, 32), (1, 32, 32, 8), (1,)

Cuantizador de proyección aleatoria

Este documento propuso primero usar un cuantizador de proyección aleatoria para modelado de voz enmascarado, donde las señales se proyectan con una matriz inicializada aleatoriamente y luego coinciden con un libro de códigos inicializado aleatorio. Por lo tanto, uno no necesita aprender el cuantizador. Esta técnica fue utilizada por el modelo de habla universal de Google para lograr SOTA para el modelado de voz a texto.

USM propone usar múltiples libros de códigos y el modelado de discursos enmascarado con un objetivo multi-Softmax. Puede hacerlo fácilmente configurando num_codebooks para que sea mayor que 1

 import torch
from vector_quantize_pytorch import RandomProjectionQuantizer

quantizer = RandomProjectionQuantizer (
    dim = 512 ,               # input dimensions
    num_codebooks = 16 ,      # in USM, they used up to 16 for 5% gain
    codebook_dim = 256 ,      # codebook dimension
    codebook_size = 1024     # codebook size
)

x = torch . randn ( 1 , 1024 , 512 )
indices = quantizer ( x )

# (1, 1024, 16)

Este repositorio también debe sincronizar automáticamente los libros de códigos en una configuración de múltiples procesos. Si de alguna manera no es así, abra un problema. Puede anular si se debe sincronizar los libros de códigos o no configurando sync_codebook = True | False

Sim VQ

Un nuevo artículo de ICLR 2025 propone un esquema donde el libro de códigos está congelado, y los códigos se generan implícitamente a través de una proyección lineal. Los autores afirman que esta configuración conduce a menos colapso de libros de códigos, así como una convergencia más fácil. He encontrado que esto funciona aún mejor cuando se combina con truco de rotación de Fifty et al., Y expandiendo la proyección lineal a una pequeña MLP de una capa. Puedes experimentar con él como así

Actualización: escuchar resultados mixtos

 import torch
from vector_quantize_pytorch import SimVQ

sim_vq = SimVQ (
    dim = 512 ,
    codebook_size = 1024 ,
    rotation_trick = True  # use rotation trick from Fifty et al.
)

x = torch . randn ( 1 , 1024 , 512 )
quantized , indices , commit_loss = sim_vq ( x )

assert x . shape == quantized . shape
assert torch . allclose ( quantized , sim_vq . indices_to_codes ( indices ), atol = 1e-6 )

Para el sabor residual, solo importe ResidualSimVQ en su lugar

 import torch
from vector_quantize_pytorch import ResidualSimVQ

residual_sim_vq = ResidualSimVQ (
    dim = 512 ,
    num_quantizers = 4 ,
    codebook_size = 1024 ,
    rotation_trick = True  # use rotation trick from Fifty et al.
)

x = torch . randn ( 1 , 1024 , 512 )
quantized , indices , commit_loss = residual_sim_vq ( x )

assert x . shape == quantized . shape
assert torch . allclose ( quantized , residual_sim_vq . get_output_from_indices ( indices ), atol = 1e-6 )

Cuantización escalar finita

	VQ	FSQ
Cuantificación	argmin_c \|\| ZC \|\|	Redonda (F (Z))
Gradiente	Estimación directa a través de la estimación (Ste)	Ste
Pérdidas auxiliares	Compromiso, libro de códigos, pérdida de entropía, ...	N / A
Engaños	EMA en el libro de códigos, división del libro de códigos, proyecciones, ...	N / A
Parámetros	Libro de códigos	N / A

Este trabajo de Google Deepmind tiene como objetivo simplificar enormemente la forma en que la cuantización vectorial se realiza para el modelado generativo, eliminando la necesidad de pérdidas de compromiso, la actualización de la EMA del libro de códigos, así como abordar los problemas con el colapso del libro de códigos o la utilización insuficiente. Simplemente redondean cada escalar hacia niveles discretos con gradientes directos; Los códigos se convierten en puntos uniformes en un hipercubo.

¡Gracias a @sekstini por portar por esta implementación en el tiempo de registro!

 import torch
from vector_quantize_pytorch import FSQ

quantizer = FSQ (
    levels = [ 8 , 5 , 5 , 5 ]
)

x = torch . randn ( 1 , 1024 , 4 ) # 4 since there are 4 levels
xhat , indices = quantizer ( x )

# (1, 1024, 4), (1, 1024)

assert torch . all ( xhat == quantizer . indices_to_codes ( indices ))

Un FSQ residual improvisado, para un intento de mejorar la codificación de audio.

El crédito va a @sekstini por instalar originalmente la idea aquí

 import torch
from vector_quantize_pytorch import ResidualFSQ

residual_fsq = ResidualFSQ (
    dim = 256 ,
    levels = [ 8 , 5 , 5 , 3 ],
    num_quantizers = 8
)

x = torch . randn ( 1 , 1024 , 256 )

residual_fsq . eval ()

quantized , indices = residual_fsq ( x )

# (1, 1024, 256), (1, 1024, 8)

quantized_out = residual_fsq . get_output_from_indices ( indices )

# (1, 1024, 256)

assert torch . all ( quantized == quantized_out )

Cuantización gratuita de búsqueda

El equipo de investigación detrás de Magvit ha publicado nuevos resultados de SOTA para el modelado de videos generativos. Un cambio central entre V1 y V2 incluye un nuevo tipo de cuantización, cuantización gratuita de búsqueda (LFQ), que elimina el libro de códigos e incrusta la búsqueda por completo.

Este artículo presenta un cuantizador LFQ simple de usar latentes binarios independientes. Existen otras implementaciones de LFQ. Sin embargo, el equipo muestra que Magvit-V2 con LFQ mejora significativamente en el punto de referencia de Imagenet. Las diferencias entre LFQ y FSQ de 2 niveles incluyen regularizaciones de entropía y pérdida de compromiso mantenida.

Desarrollar un método más avanzado de cuantificación LFQ sin libros de códigos-mira podría revolucionar el modelado generativo.

Puede usarlo simplemente de la siguiente manera. Será perrito en el puerto Magvit2 Pytorch

 import torch
from vector_quantize_pytorch import LFQ

# you can specify either dim or codebook_size
# if both specified, will be validated against each other

quantizer = LFQ (
    codebook_size = 65536 ,      # codebook size, must be a power of 2
    dim = 16 ,                   # this is the input feature dimension, defaults to log2(codebook_size) if not defined
    entropy_loss_weight = 0.1 ,  # how much weight to place on entropy loss
    diversity_gamma = 1.        # within entropy loss, how much weight to give to diversity of codes, taken from https://arxiv.org/abs/1911.05894
)

image_feats = torch . randn ( 1 , 16 , 32 , 32 )

quantized , indices , entropy_aux_loss = quantizer ( image_feats , inv_temperature = 100. )  # you may want to experiment with temperature

# (1, 16, 32, 32), (1, 32, 32), ()

assert ( quantized == quantizer . indices_to_codes ( indices )). all ()

También puede pasar las funciones de video como (batch, feat, time, height, width) o secuencias como (batch, seq, feat)

 import torch
from vector_quantize_pytorch import LFQ

quantizer = LFQ (
    codebook_size = 65536 ,
    dim = 16 ,
    entropy_loss_weight = 0.1 ,
    diversity_gamma = 1.
)

seq = torch . randn ( 1 , 32 , 16 )
quantized , * _ = quantizer ( seq )

assert seq . shape == quantized . shape

video_feats = torch . randn ( 1 , 16 , 10 , 32 , 32 )
quantized , * _ = quantizer ( video_feats )

assert video_feats . shape == quantized . shape

O admitir múltiples libros de códigos

 import torch
from vector_quantize_pytorch import LFQ

quantizer = LFQ (
    codebook_size = 4096 ,
    dim = 16 ,
    num_codebooks = 4  # 4 codebooks, total codebook dimension is log2(4096) * 4
)

image_feats = torch . randn ( 1 , 16 , 32 , 32 )

quantized , indices , entropy_aux_loss = quantizer ( image_feats )

# (1, 16, 32, 32), (1, 32, 32, 4), ()

assert image_feats . shape == quantized . shape
assert ( quantized == quantizer . indices_to_codes ( indices )). all ()

Un LFQ residual improvisado, para ver si puede conducir a una mejora para la compresión de audio.

 import torch
from vector_quantize_pytorch import ResidualLFQ

residual_lfq = ResidualLFQ (
    dim = 256 ,
    codebook_size = 256 ,
    num_quantizers = 8
)

x = torch . randn ( 1 , 1024 , 256 )

residual_lfq . eval ()

quantized , indices , commit_loss = residual_lfq ( x )

# (1, 1024, 256), (1, 1024, 8), (8)

quantized_out = residual_lfq . get_output_from_indices ( indices )

# (1, 1024, 256)

assert torch . all ( quantized == quantized_out )

Cuantificación latente

Desenglemento es esencial para el aprendizaje de representación, ya que promueve la interpretabilidad, la generalización, el aprendizaje mejorado y la robustez. Se alinea con el objetivo de capturar características significativas e independientes de los datos, facilitando el uso más efectivo de representaciones aprendidas en varias aplicaciones. Para un mejor desenredado, el desafío es desenredar las variaciones subyacentes en un conjunto de datos sin información de verdad de tierra explícita. Este trabajo introduce un sesgo inductivo clave destinado a codificar y decodificar dentro de un espacio latente organizado. La estrategia incorporada abarca la discretización del espacio latente al asignar vectores de código discretos a través de la utilización de un libro de códigos escalar individual para cada dimensión. Esta metodología permite que sus modelos superen los métodos previos robustos de manera efectiva.

Tenga en cuenta que tuvieron que usar una descomposición de peso muy alto para los resultados en este documento.

 import torch
from vector_quantize_pytorch import LatentQuantize

# you can specify either dim or codebook_size
# if both specified, will be validated against each other

quantizer = LatentQuantize (
    levels = [ 5 , 5 , 8 ],      # number of levels per codebook dimension
    dim = 16 ,                   # input dim
    commitment_loss_weight = 0.1 ,  
    quantization_loss_weight = 0.1 ,
)

image_feats = torch . randn ( 1 , 16 , 32 , 32 )

quantized , indices , loss = quantizer ( image_feats )

# (1, 16, 32, 32), (1, 32, 32), ()

assert image_feats . shape == quantized . shape
assert ( quantized == quantizer . indices_to_codes ( indices )). all ()

También puede pasar las funciones de video como (batch, feat, time, height, width) o secuencias como (batch, seq, feat)

 import torch
from vector_quantize_pytorch import LatentQuantize

quantizer = LatentQuantize (
    levels = [ 5 , 5 , 8 ],
    dim = 16 ,
    commitment_loss_weight = 0.1 ,  
    quantization_loss_weight = 0.1 ,
)

seq = torch . randn ( 1 , 32 , 16 )
quantized , * _ = quantizer ( seq )

# (1, 32, 16)

video_feats = torch . randn ( 1 , 16 , 10 , 32 , 32 )
quantized , * _ = quantizer ( video_feats )

# (1, 16, 10, 32, 32)

O admitir múltiples libros de códigos

 import torch
from vector_quantize_pytorch import LatentQuantize

model = LatentQuantize (
    levels = [ 4 , 8 , 16 ],
    dim = 9 ,
    num_codebooks = 3
)

input_tensor = torch . randn ( 2 , 3 , dim )
output_tensor , indices , loss = model ( input_tensor )

# (2, 3, 9), (2, 3, 3), ()

assert output_tensor . shape == input_tensor . shape
assert indices . shape == ( 2 , 3 , num_codebooks )
assert loss . item () >= 0

Citas

 @misc { oord2018neural ,
    title   = { Neural Discrete Representation Learning } ,
    author  = { Aaron van den Oord and Oriol Vinyals and Koray Kavukcuoglu } ,
    year    = { 2018 } ,
    eprint  = { 1711.00937 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.LG }
}

 @misc { zeghidour2021soundstream ,
    title   = { SoundStream: An End-to-End Neural Audio Codec } ,
    author  = { Neil Zeghidour and Alejandro Luebs and Ahmed Omran and Jan Skoglund and Marco Tagliasacchi } ,
    year    = { 2021 } ,
    eprint  = { 2107.03312 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.SD }
}

 @inproceedings { anonymous2022vectorquantized ,
    title   = { Vector-quantized Image Modeling with Improved {VQGAN} } ,
    author  = { Anonymous } ,
    booktitle = { Submitted to The Tenth International Conference on Learning Representations } ,
    year    = { 2022 } ,
    url     = { https://openreview.net/forum?id=pfNyExj7z2 } ,
    note    = { under review }
}

 @inproceedings { lee2022autoregressive ,
    title   = { Autoregressive Image Generation using Residual Quantization } ,
    author  = { Lee, Doyup and Kim, Chiheon and Kim, Saehoon and Cho, Minsu and Han, Wook-Shin } ,
    booktitle = { Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition } ,
    pages   = { 11523--11532 } ,
    year    = { 2022 }
}

 @article { Defossez2022HighFN ,
    title   = { High Fidelity Neural Audio Compression } ,
    author  = { Alexandre D'efossez and Jade Copet and Gabriel Synnaeve and Yossi Adi } ,
    journal = { ArXiv } ,
    year    = { 2022 } ,
    volume  = { abs/2210.13438 }
}

 @inproceedings { Chiu2022SelfsupervisedLW ,
    title   = { Self-supervised Learning with Random-projection Quantizer for Speech Recognition } ,
    author  = { Chung-Cheng Chiu and James Qin and Yu Zhang and Jiahui Yu and Yonghui Wu } ,
    booktitle = { International Conference on Machine Learning } ,
    year    = { 2022 }
}

 @inproceedings { Zhang2023GoogleUS ,
    title   = { Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages } ,
    author  = { Yu Zhang and Wei Han and James Qin and Yongqiang Wang and Ankur Bapna and Zhehuai Chen and Nanxin Chen and Bo Li and Vera Axelrod and Gary Wang and Zhong Meng and Ke Hu and Andrew Rosenberg and Rohit Prabhavalkar and Daniel S. Park and Parisa Haghani and Jason Riesa and Ginger Perng and Hagen Soltau and Trevor Strohman and Bhuvana Ramabhadran and Tara N. Sainath and Pedro J. Moreno and Chung-Cheng Chiu and Johan Schalkwyk and Franccoise Beaufays and Yonghui Wu } ,
    year    = { 2023 }
}

 @inproceedings { Shen2023NaturalSpeech2L ,
    title   = { NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers } ,
    author  = { Kai Shen and Zeqian Ju and Xu Tan and Yanqing Liu and Yichong Leng and Lei He and Tao Qin and Sheng Zhao and Jiang Bian } ,
    year    = { 2023 }
}

 @inproceedings { Yang2023HiFiCodecGV ,
    title   = { HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec } ,
    author  = { Dongchao Yang and Songxiang Liu and Rongjie Huang and Jinchuan Tian and Chao Weng and Yuexian Zou } ,
    year    = { 2023 }
}

 @inproceedings { huh2023improvedvqste ,
    title   = { Straightening Out the Straight-Through Estimator: Overcoming Optimization Challenges in Vector Quantized Networks } ,
    author  = { Huh, Minyoung and Cheung, Brian and Agrawal, Pulkit and Isola, Phillip } ,
    booktitle = { International Conference on Machine Learning } ,
    year    = { 2023 } ,
    organization = { PMLR }
}

 @inproceedings { rogozhnikov2022einops ,
    title   = { Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation } ,
    author  = { Alex Rogozhnikov } ,
    booktitle = { International Conference on Learning Representations } ,
    year    = { 2022 } ,
    url     = { https://openreview.net/forum?id=oapKSVM2bcj }
}

 @misc { shin2021translationequivariant ,
    title   = { Translation-equivariant Image Quantizer for Bi-directional Image-Text Generation } ,
    author  = { Woncheol Shin and Gyubok Lee and Jiyoung Lee and Joonseok Lee and Edward Choi } ,
    year    = { 2021 } ,
    eprint  = { 2112.00384 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { mentzer2023finite ,
    title   = { Finite Scalar Quantization: VQ-VAE Made Simple } ,
    author  = { Fabian Mentzer and David Minnen and Eirikur Agustsson and Michael Tschannen } ,
    year    = { 2023 } ,
    eprint  = { 2309.15505 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { yu2023language ,
    title   = { Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation } ,
    author  = { Lijun Yu and José Lezama and Nitesh B. Gundavarapu and Luca Versari and Kihyuk Sohn and David Minnen and Yong Cheng and Agrim Gupta and Xiuye Gu and Alexander G. Hauptmann and Boqing Gong and Ming-Hsuan Yang and Irfan Essa and David A. Ross and Lu Jiang } ,
    year    = { 2023 } ,
    eprint  = { 2310.05737 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @inproceedings { Zhao2024ImageAV ,
  title     = { Image and Video Tokenization with Binary Spherical Quantization } ,
  author    = { Yue Zhao and Yuanjun Xiong and Philipp Krahenbuhl } ,
  year      = { 2024 } ,
  url       = { https://api.semanticscholar.org/CorpusID:270380237 }
}

 @misc { hsu2023disentanglement ,
    title   = { Disentanglement via Latent Quantization } , 
    author  = { Kyle Hsu and Will Dorrell and James C. R. Whittington and Jiajun Wu and Chelsea Finn } ,
    year    = { 2023 } ,
    eprint  = { 2305.18378 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.LG }
}

 @inproceedings { Irie2023SelfOrganisingND ,
    title   = { Self-Organising Neural Discrete Representation Learning `a la Kohonen } ,
    author  = { Kazuki Irie and R'obert Csord'as and J{"u}rgen Schmidhuber } ,
    year    = { 2023 } ,
    url     = { https://api.semanticscholar.org/CorpusID:256901024 }
}

 @article { Huijben2024ResidualQW ,
    title   = { Residual Quantization with Implicit Neural Codebooks } ,
    author  = { Iris Huijben and Matthijs Douze and Matthew Muckley and Ruud van Sloun and Jakob Verbeek } ,
    journal = { ArXiv } ,
    year    = { 2024 } ,
    volume  = { abs/2401.14732 } ,
    url     = { https://api.semanticscholar.org/CorpusID:267301189 }
}

 @article { Fifty2024Restructuring ,
    title   = { Restructuring Vector Quantization with the Rotation Trick } ,
    author  = { Christopher Fifty, Ronald G. Junkins, Dennis Duan, Aniketh Iyengar, Jerry W. Liu, Ehsan Amid, Sebastian Thrun, Christopher Ré } ,
    journal = { ArXiv } ,
    year    = { 2024 } ,
    volume  = { abs/2410.06424 } ,
    url     = { https://api.semanticscholar.org/CorpusID:273229218 }
}

 @inproceedings { Zhu2024AddressingRC ,
    title   = { Addressing Representation Collapse in Vector Quantized Models with One Linear Layer } ,
    author  = { Yongxin Zhu and Bocheng Li and Yifei Xin and Linli Xu } ,
    year    = { 2024 } ,
    url     = { https://api.semanticscholar.org/CorpusID:273812459 }
}