Zhipu releases the first open source biographical model that can generate Chinese characters CogView4 - AI Articles

Author：Eve Cole Update Time：2025-05-16 12:25:02

On March 4, 2025, Beijing Zhipu Huazhang Technology Co., Ltd. officially released its latest open source biographical graphics model - CogView4. This model performed excellently in the DPG-Bench benchmark test, with a top overall score, and became a technical benchmark in the current open source literary and biographical model. CogView4 not only follows the Apache 2.0 protocol, but is also the first image generation model to support the protocol, marking a new milestone in open source image generation technology.

The core advantage of CogView4 is its powerful complex semantic alignment and instruction following capabilities. It can process Chinese and English bilingual input of any length and generate images of any resolution. This feature makes CogView4 have broad application prospects in creative fields such as advertising and short videos. Technically, CogView4 adopts the GLM-4encoder with bilingual skills. Through bilingual Chinese and English graphic training, it realizes the ability to input bilingual prompt words, further improving the practicality and flexibility of the model.

微信截图_20250304133838.png

In terms of image generation, CogView4 supports any length of prompt word input, and can generate images of any resolution, greatly improving creative freedom and training efficiency. The model uses two-dimensional rotational position coding (2D RoPE) to model image position information, and supports image generation at different resolutions through interpolated position coding. In addition, CogView4 also adopts the Flow-matching scheme for diffusion generation modeling, combining parameterized linear dynamic noise planning to adapt to the signal-to-noise ratio requirements of images with different resolutions and ensure high quality of the generated images.

In terms of architectural design, CogView4 continues the previous generation of Share-param DiT architecture and designs independent adaptive LayerNorm layers for text and image modalities to achieve efficient adaptation between modalities. The model adopts a multi-stage training strategy, including basic resolution training, general resolution training, high-quality data fine-tuning, and human preference alignment training, ensuring that the generated images not only have a high aesthetic sense, but also conform to human aesthetic preferences.

CogView4 also breaks through the traditional fixed token length limit, allowing higher token upper limit, and significantly reduces text token redundancy during training. When the average length of the training caption is 200-300 token, compared with the traditional solution of fixed 512 tokens, CogView4 reduces token redundancy by about 50% and achieves a 5%-30% efficiency improvement in the model progressive training stage, further optimizing the training effect of the model.

In addition, CogView4 supports Apache 2.0 protocol, and will gradually add ecological support such as ControlNet and ComfyUI in the future. A complete set of fine-tuning toolkits will be launched soon, providing developers with a more convenient user experience. The open source warehouse address is: https://github.com/THUDM/CogView4, and the model warehouse address is: https://huggingface.co/THUDM/CogView4-6B and https://modelscope.cn/models/ZhipuAI/CogView4-6B.