← Bookmarks 📄 Article

GitHub - DepthAnything/Video-Depth-Anything: [CVPR 2025 Highlight] Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

CVPR 2025 Highlight that solves temporal flickering in video depth estimation through cross-frame attention, enabling consistent depth maps across thousands of frames without retraining for different video lengths.

· ai ml
Read Original
Listen to Article
0:000:00
Summary used for search

• Introduces cross-frame attention architecture that processes videos in 8-frame chunks to eliminate flickering artifacts that plague single-image depth models applied to video
• Trained on 17.6M video clips (8.8B frames)—orders of magnitude larger than previous video depth datasets
• Handles arbitrary video lengths through sliding window approach: same model works on 10-frame clips or 10,000-frame movies
• Provides both inference and training code plus pre-trained models for immediate use
• Solves the production-blocking problem of temporal inconsistency in depth estimation for film, AR/VR, and autonomous systems

Video Depth Anything tackles the fundamental problem that existing depth estimation models—even state-of-the-art ones like Depth Anything v1/v2—create flickering artifacts when applied to video because they process frames independently. The authors argue that video depth requires video-native architecture, not just image models run frame-by-frame. Their solution uses cross-frame attention mechanisms that process videos in 8-frame chunks, maintaining temporal consistency through explicit modeling of inter-frame relationships.

The scale of their training data is unprecedented: 17.6M video clips containing 8.8B frames, compared to previous video depth datasets with only millions of frames. This massive dataset, combined with temporal consistency losses, enables the model to learn stable depth predictions across time. The architecture uses a sliding window approach that processes long videos in overlapping chunks, maintaining consistency at chunk boundaries. Critically, the model generalizes to arbitrary video lengths without retraining—you can feed it a 10-frame clip or a feature-length film.

The practical implications are significant for any application requiring depth estimation over time. Film and video production can now generate consistent depth maps for effects work without manual cleanup of flickering. AR/VR applications get stable depth for video content. Autonomous systems can maintain consistent spatial understanding across continuous operation. The team provides complete inference code, training code, and pre-trained models, making this immediately usable for production applications that were previously blocked by temporal inconsistency issues.