In today's digital age, AI-generated short videos have become commonplace, but these videos often lack depth and coherence and are difficult to truly impress the audience. To solve this problem, Long Context Tuning (LCT) technology came into being. This technology gives AI video generation models the ability to direct multi-shot narrative videos, allowing them to switch freely between different shots like movies and TV series to create a more coherent and fascinating story scene.

In the past, top AI video generation models, such as SoRA, Kling and Gen3, have been able to generate realistic single-lens videos up to one minute. However, these models still have huge challenges in generating multi-lens narrative videos. A movie scene is often composed of multiple different single-shot videos that capture the same coherent event, which requires the model to maintain a high degree of consistency in visual appearance and temporal dynamics.
Take the classic scene where Jack and Ruth meet on the deck in the movie Titanic, which contains four main shots: a close-up of Jack looking back, a medium shot of Ruth talking, a wide-angle shot of Ruth walking towards Jack, and a close-up of Jack embracing Ruth from behind. To generate such a scene, it is not only necessary to ensure the consistency of character characteristics, background, light and tones, but also to maintain the rhythm of character movement and the smoothness of camera movement to ensure the smoothness of narrative.
To bridge the gap between single-lens generation and multi-lens narrative, researchers have proposed a variety of approaches, but most of these methods have limitations. Some methods rely on inputting key visual elements to force visual consistency across the lens, but are difficult to control more abstract elements such as light and tones. Other methods form a coherent set of keyframes and then use the image-to-video (I2V) model to synthesize each lens independently, which is difficult to ensure temporal consistency between the lenses, and sparse keyframes also limit the effectiveness of the conditions.
The emergence of LCT technology is precisely to solve these problems. It expands the context window of the single-lens video diffusion model, allowing it to learn coherence between shots directly from scene-level video data. The core innovative design of LCT includes the expansion of full attention mechanisms, interlaced 3D position embedding, and asynchronous noise strategies. These designs allow the model to "focus" all visual and textual information of the entire scene at the same time when generating videos, thereby better understanding and maintaining cross-lens dependencies.
Experimental results show that the LCT-adjusted single-lens model performs well in generating coherent multi-lens scenes and demonstrates some surprising new abilities. For example, it can be generated in combination based on a given role identity and environment image, even if the model has not been specially trained for such tasks before. In addition, the LCT model also supports autoregressive lens expansion, which can be achieved whether it is continuous single-lens extension or multi-lens extension with lens switching. This feature is especially useful for long video creation because it breaks down long video generation into multiple scene segments, which facilitates users to make interactive modifications.
Going further, the researchers also found that after LCT, models with bidirectional attention can be further fine-tuned to contextual causal attention. This improved attention mechanism remains bidirectional attention within each lens, but between lenses, information can only flow from previous lenses to subsequent lenses. This one-way information flow allows KV-cache (a caching mechanism) to be efficiently utilized during autoregression generation, thereby significantly reducing computational overhead.
As shown in Figure 1, LCT technology can be directly applied to short film production to achieve scene-level video generation. Even more exciting, it also spawns a variety of emerging capabilities such as interactive multi-lens directors, single-lens expansion, and combination generation of zero samples, although the model has never been trained for these specific tasks. As shown in Figure 2, an example of scene-level video data is shown, which contains global prompts (describing the character, environment, and story summary) and specific event descriptions for each shot.
In summary, long context adjustment (LCT) opens up a new path for more practical visual content creation by extending the context window of the single-lens video diffusion model, allowing it to learn scene-level coherence directly from the data. This technology not only improves the narrative ability and coherence of AI-generated videos, but also provides new ideas for future long video generation and interactive video editing. We have reason to believe that future video creation will become more intelligent and creative due to advances in technologies such as LCT.
Project address: https://top.aibase.com/tool/zhangshangxiawentiaoyouulct
Paper address: https://arxiv.org/pdf/2503.10589