Recently, Tongyi Company announced the open source of its latest Tongyi Wanxiang Movie Model Wan2.1, which has attracted widespread attention in the field of artificial intelligence. Wan2.1 is an AI model focusing on high-quality video generation. With its outstanding performance in handling complex motion, restoring real physical laws, improving the texture of film and television, and optimizing instructions, Wan2.1 has become the first tool for creators, developers and enterprise users to embrace the AI era.

In the authoritative evaluation set Vbench, Tongyi Wanxiang Wan2.1 topped the list with a total score of 86.22%, significantly leading other well-known video generation models at home and abroad, such as Sora, Minimax, Luma, Gen3 and Pika. This achievement is due to Wan2.1's mainstream DiT and linear noise trajectory Flow Matching paradigm, and has achieved significant advances in generation capabilities through a series of technological innovations. Among them, the self-developed and efficient 3D causal VAE module realizes 256 times lossless video hidden space compression, and supports efficient encoding and decoding of videos of any length through feature caching mechanism, while reducing memory usage during inference by 29%. In addition, the model has a significant performance advantage in a single A800GPU environment with video reconstruction speeds 2.5 times faster than the existing state-of-the-art methods.
Wan2.1's video Diffusion Transformer architecture effectively models long-term space-time dependencies through the Full Attention mechanism to generate high-quality and time-space-consistent videos. Its training strategy adopts a 6-stage step-by-step training method, which gradually transitions from pre-training of low-resolution image data to training of high-resolution video data, and finally fine-tune it through high-quality annotation data to ensure the excellent performance of the model in different resolutions and complex scenarios. In terms of data processing, Wan2.1 designed a four-step data cleaning process, focusing on basic dimensions, visual quality and motion quality, to screen out high-quality and diverse data from noisy initial data sets to promote effective training.

Wan2.1 adopts a variety of strategies in terms of model training and inference efficiency optimization. During the training phase, different distributed strategies are adopted for text, video encoding modules and DiT modules, and computational redundancy is avoided through efficient strategy switching. In terms of video memory optimization, a layered video memory optimization strategy is adopted and combined with the PyTorch video memory management mechanism to solve the problem of video memory fragmentation. In the inference stage, multi-card distributed acceleration is used to use the combined method of FSDP and 2D CP, and further improve performance through quantitative methods.
At present, Tongyi Wanxiang Wan2.1 has been open sourced on platforms such as GitHub, Hugging Face and Mopa Community, supporting a variety of mainstream frameworks. Developers and researchers can experience it quickly through Gradio, or use xDiT parallel accelerated reasoning to improve efficiency. At the same time, the model is accelerating the access to Diffusers and ComfyUI to simplify the one-click reasoning and deployment process, lower the development threshold, and provide users with flexible choices, whether it is rapid prototype development or efficient production deployment, it can be easily implemented.
Github:https://github.com/Wan-Video
HuggingFace:https://huggingface.co/Wan-AI
Online experience: https://tongyi.aliyun.com/wanxiang