Recently, HPC-AI Tech announced the launch of Open-Sora 2.0, a breakthrough video AI system that can achieve commercial-grade quality at only about one-tenth of the typical training cost. This progress marks a possible paradigm shift in the resource-intensive video AI field, comparable to the efficiency gains seen in language models.
While existing high-quality video generation systems like Movie Gen and Step-Video-T2V may require millions of dollars in training, Open-Sora2.0's training spend is only about $200,000. Despite the substantial cost reduction, testing has shown that its output quality is comparable to established commercial systems such as Runway Gen-3Alpha and HunyuanVideo. The system uses 224 Nvidia H200GPUs for training.
Tip: "Two women sit on beige sofa, the room is warm and comfortable, with brick walls in the background. They talk happily, smile, and raise glasses to celebrate red wine in the intimate mid-scene shot." | Video: HPC-AI Tech
Open-Sora2.0 achieves its efficiency through a novel three-stage training process, starting with low-resolution video and gradually refines to higher resolutions. Integrated pre-trained image models such as Flux further optimize resource utilization. At its core is the video DC-AE autoencoder, which provides excellent compression rates compared to traditional methods. This innovation translates into a remarkable 5.2x faster training speed and over tenx faster video generation speed. While higher compression rates lead to a slight reduction in output details, it greatly speeds up the video creation process.
Tip: "A tomato surfs a slice of lettuce, down the pasture sauce waterfall, exaggerated surfing and smooth wave effects highlight the fun of 3D animation." | Video: HPC-AI Tech
This open source system can generate videos from text descriptions and single images, and allows users to control the intensity of motion in the generated clips through the motion scoring function. Examples provided by HPC-AI Tech showcase a variety of scenarios, including realistic dialogue and whimsical animations.
However, Open-Sora 2.0 currently has limitations in resolution (768x768 pixels) and maximum video duration (5 seconds or 128 frames), which is inferior to the capabilities of leading models such as OpenAI's Sora. Nevertheless, its performance in key areas such as visual quality, cues accuracy and motion processing is approaching commercial standards. It is worth noting that Open-Sora2.0's VBench score is now only 0.69% behind OpenAI's Sora, a significant improvement from the previous version's 4.52%.
Tip: "A bunch of anthropomorphic mushrooms hold a disco party in a dark magical forest, accompanied by flashing neon lights and exaggerated dance steps, their smooth textures and reflective surfaces emphasize the funny 3D appearance." | Video: HPC-AI Tech
The cost-effective strategy of Open-Sora2.0 echoes the “Deepseek moment” in the language model, when improved training methods enabled open source systems to achieve commercial-grade performance at a much lower cost than commercial systems. This development could put downward pressure on prices in the video AI field, which is currently charged by seconds due to high computing demand.
Training Cost Comparison: Open-Sora2.0 costs about $200,000, while Movie Gen costs $2.5 million, and Step-Video-T2V costs $1 million. | Photo: HPC-AI Tech
Despite this progress, the performance gap between open source and commercial video AI is still greater than that of language models, highlighting the ongoing technical challenges in the field. Open-Sora2.0 is now available as an open source project on GitHub.