In recent years, large language models (LLMs) have made breakthrough progress in the field of artificial intelligence, especially in multimodal fusion. A joint team from Huazhong University of Science and Technology, ByteDance and the University of Hong Kong recently proposed an innovative multimodal generation framework - Liquid, aiming to solve the limitations of current mainstream multimodal models in visual processing. The emergence of this technology marks the further development of artificial intelligence in the multimodal field.
Traditional multimodal mockups often rely on complex external vision modules, which not only increases the complexity of the system, but also limits its scalability and flexibility. Liquid's innovation is that it adopts VQGAN as an image word segmenter and abandons its dependence on external visual components. By encoding the image into discrete visual tokens, Liquid enables the model to share the word list directly with the text tokens, thereby achieving "native" visual understanding and generation capabilities. This design greatly simplifies the model structure while improving its scalability.
The study found that Liquid not only significantly reduces training costs, but also reveals the scale rules of multimodal capabilities and LLM. The research team conducted experiments on LLMs of different sizes (from 0.5B to 32B). The results showed that as the model scale expanded, the performance and generation quality of its visual generation tasks followed a scaling pattern consistent with the language tasks. What is even more exciting is that there is a two-way facilitation relationship between visual understanding and generative tasks, that is, the two can achieve joint optimization through a shared representation space. This discovery provides an important theoretical basis for future multimodal model design.
Liquid's design fully embodies minimalism, treating images and text equally, adopting a unified processing framework. During the construction process, the research team used 30M text data and 30M picture-text data to lay the foundation for multimodal training of the model. The final experimental results show that Liquid has excellent performance in multimodal comprehension, image generation, and plain text tasks, and the semantic consistency between the generated images and text is significantly higher than other autoregressive models. This result demonstrates the great potential of Liquid in practical applications.
Liquid's proposal provides new ideas for the architectural design of general multimodal intelligence, indicating that artificial intelligence may usher in more efficient and flexible evolution in the future of multimodal fusion. The success of this technology not only promotes research in the multimodal field, but also opens up new possibilities for the application of artificial intelligence in more practical scenarios.
Paper link: https://arxiv.org/pdf/2412.04332