Efficient tracking of 3D motion from single-lens video has always been a difficult problem in the field of computer vision, especially when dealing with long-sequence videos and pursuing pixel-level accuracy. Traditional methods are limited by computing resources and algorithm complexity, making it difficult to achieve a detailed understanding of the complete scene and intensive three-dimensional motion tracking. This article will introduce a new approach called DELTA, which aims to solve this puzzle efficiently and achieve significant results.
In the field of video processing, it has been a difficult problem to efficiently track three-dimensional motion from single-lens video, especially when pixel-level precise tracking of long sequences is required. Traditional methods face multiple challenges, often only track a small number of key points, and cannot achieve a detailed understanding of the complete scenario.

Moreover, the computing demands of the prior art are high, making it difficult to maintain efficiency when processing long videos. At the same time, long-term tracking will also be affected by problems such as camera movement and object occlusion, resulting in tracking errors or errors.
Currently, the methods of video sequence motion estimation have their own advantages and disadvantages. Optical flow technology provides intensive pixel tracking, but exhibits insufficient toughness in complex scenarios, especially when dealing with long sequences.
Scenario flow is an extension of optical flow, estimating dense three-dimensional motion through RGB-D data or point clouds, but it is still difficult to apply efficiently in long sequences. Although the point tracking method can capture motion trajectories and combines spatial and temporal attention to achieve smoother tracking, it is still difficult to achieve intensive monitoring due to the high computational cost. Furthermore, the reconstruction-based tracking method uses deformation fields to estimate motion, but is not very practical in real-time applications.

Recently, a research team from the University of Massachusetts Amherst, MIT-IBM Watson Artificial Intelligence Laboratory and Snap Inc. proposed DELTA (Dense Efficient Long-range3D Tracking for Any Video), a type of tracking designed for efficient tracking. Method designed for each pixel in three-dimensional space. DELTA starts with low resolution tracking, adopts a spatiotemporal attention mechanism and applies an attention-based upsampler for high resolution accuracy. Its key innovations include upsamplers for clear motion boundaries, efficient spatial attention architecture, and log-depth representations of enhanced tracking performance.
DELTA has achieved advanced results on CVO and Kubric3D datasets, improving by more than 10% on indicators such as average Jaccard (AJ) and 3D average position difference (APD3D), and also in 3D point tracking benchmarks such as TAP-Vid3D and LSFOdyssey. outstanding. Unlike existing methods, DELTA implements intensive three-dimensional tracking on scale, running at more than 8 times faster than previous methods, while maintaining industry-leading accuracy.
Experiments show that DELTA performs excellently in three-dimensional tracking tasks, with speed and accuracy exceeding previous methods. DELTA is trained on the Kubric dataset and contains over 5600 videos, with a loss function combining 2D coordinates, depth and visibility losses.
In the benchmark, DELTA scored the highest scores in CVO and Kubric3D on long-distance 2D tracking and intensive 3D tracking, respectively, and tasks completed much faster than other methods. DELTA’s design choices, such as logarithmic depth representation, spatial attention, and attention-based upsamplers, significantly improve their accuracy and efficiency in a variety of tracking scenarios.
DELTA is an efficient method that can track each pixel in video frames, achieving accuracy and faster runtime in dense D and 3D tracking. This method may face challenges at points that are blocked for a long time, with the best performance in short videos with no more than hundreds of frames. The 3D tracking accuracy of DELTA depends on the accuracy and time domain stability of the monocular depth estimation used. Research progress in monocular depth estimation is expected to further enhance the performance of this method.
Project entrance: https://snap-research.github.io/DELTA/
Key points:
DELTA is a completely new approach designed to efficiently track every pixel in a single-lens video.
DELTA achieves leading results on CVO and Kubric3D datasets at 8 times faster than traditional methods.
This method may be challenging at long-term occlusion points, but it performs excellently on short videos.
In summary, the DELTA method has made breakthrough progress in three-dimensional motion tracking of single-lens videos, and its efficiency and high accuracy provide new possibilities for future video processing applications. But this approach still needs to be further refined to deal with more complex and longer video scenarios.