The latest open source cultural and biographical model CogView4 launched by Zhipu AI is officially released, marking another major breakthrough in artificial intelligence in the field of image generation. CogView4 not only has a parameter scale of up to 600 million, but also realizes full support for Chinese input and Chinese text to image generation for the first time. It is known as "the first open source model that can generate Chinese characters in the picture." This innovation provides powerful tools for Chinese content creators and greatly promotes the development of image generation technology in the Chinese context.
The core highlight of CogView4 is that it supports Chinese and English prompt word input, especially in handling complex Chinese instructions. As the first open source biographical model that can generate Chinese characters in images, CogView4 fills a big gap in the open source field. In addition, the model also supports the generation of pictures of any aspect ratio and can process prompt word input of any length, showing extremely high flexibility and adaptability, meeting the needs of different scenarios.
In terms of technical architecture, CogView4 has been fully upgraded, and its text encoder has been upgraded to GLM-4, supporting Chinese and English bilingual input, completely breaking the previous limitation of the open source model that only supports English. By using Chinese and English bilingual graphic pairs to train, the generation quality of CogView4 in the Chinese context has been significantly improved, ensuring its accuracy and fluency when processing Chinese text.
In terms of text processing, CogView4 abandons the traditional fixed-length design and adopts a dynamic text length scheme. When the average description text is 200-300 word elements, compared with the traditional solution with fixed 512 word elements, the redundancy is reduced by about 50%, and the training efficiency is improved by 5%-30%. This innovation not only optimizes the use of computing resources, but also allows the model to process prompt words of varying lengths more efficiently, further improving the quality and diversity of generated images.
CogView4 supports the generation of images of any resolution, thanks to several technological breakthroughs. The model is trained with mixed resolution, combined with two-dimensional rotational position coding and interpolated position representation, which can adapt to the needs of different sizes. In addition, based on the Flow-matching diffusion model and parameterized linear dynamic noise planning, CogView4 further improves the quality and diversity of generated images, making it perform better in complex scenarios.
The training process of CogView4 is divided into multiple stages, starting from basic resolution training, to general resolution adaptation, to high-quality data fine-tuning, and finally optimized output through human preference alignment. This process retains the Share-param DiT architecture, while introducing independent adaptive layer normalization for different modes to ensure the stability and consistency of the model in multiple tasks. This refined training process allows CogView4 to better meet user needs when generating images.
Project address: https://github.com/THUDM/CogView4