GaVS: 3D-Grounded Video Stabilization

via Temporally-Consistent Local Reconstruction and Rendering

Zinuo You¹ Stamatios Georgoulis² Anpei Chen^1,3 Siyu Tang¹ Dengxin Dai²

¹ETH Zurich ²Huawei Research Center, Zurich ³University of Tübingen, Tübingen AI Center

SIGGRAPH 2025

arXiv Code Dataset Result Collection

TL;DR: We reformulate video stabilization task with feed-forward 3DGS reconstruction, ensuring robustness to diverse motions, full-frame rendering and high geometry consistency.

Abstract

Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent. Existing approaches, depending on the domain they operate, suffer from several issues (e.g. geometric distortions, excessive cropping, poor generalization) that degrade the user experience.

To address these issues, we introduce GaVS, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent `local reconstruction and rendering' paradigm. Given 3D camera poses, we augment a reconstruction model to predict Gaussian Splatting primitives, and finetune it at test-time, with multi-view dynamics-aware photometric supervision and cross-frame regularization, to produce temporally-consistent local reconstructions. The model are then used to render each stabilized frame. We utilize a scene extrapolation module to avoid frame cropping.

Our method is evaluated on a repurposed dataset, instilled with 3D-grounded information, covering samples with diverse camera motions and scene dynamics. Quantitatively, our method is competitive with or superior to state-of-the-art 2D and 2.5D approaches in terms of conventional task metrics and new geometry consistency. Qualitatively, our method produces noticeably better results compared to alternatives, validated by the user study.

Method Overview

We stabilize a video in two phases. 1. Test-Time Finetuning. We finetune a pretrained reconstruction model that first predicts a local 3DGS scene reconstruction in a feed-forward manner. During the test-time finetuning, we only update the decoder, which predicts position offsets and extra 3DGS attributes other than positions. Local reconstructions are firstly extrapolated by video completion on the image domain, providing initial reconstruction. Then each local reconstruction is supervised on original images with a multi-view photometric loss to improve quality and reduce distortions. The loss contains compensation for dynamic objects. Furthermore, local reconstructions within a dilated temporal window are regularized by encouraging similarity in their primitives matched through dense correspondences from optical flow. 2. Inference. Each extrapolated reconstruction is rendered with its corresponding stabilized pose to get the stabilized full-frame video.

Introduction Video

Qualitative Results

Comparison Method:

Evaluation

Besides conventional video stabilization metrics - distortion, stability and cropping ratio, we further introduce sparse and dense geometry consistency metrics to evaluate 'how 3D space is distorted in the stabilized videos'.

As shown in the comparision radar map below, GaVS stands out as the most comprehensive method, performing on par with or exceeding the top-performing approaches across all metrics. This is particularly evident in challenging scenarios characterized by intense motion and dynamics. The user study results further confirm that GaVS is overwhelmingly preferred.

Stability Control

By explicitly modeling 3D camera motions, GaVS can control the stability of the output video. The following videos show how GaVS can produce videos with different levels of stability by adjusting the stability control parameter.

Limitations

GaVS is a full-frame video stabilization method that works well for most videos. However, it may struggle with videos that have extreme camera motion or inaccurate 3D camera tracking. In the following video, we show some examples where GaVS fails to produce satisfactory results.

Related Work

We compare GaVS with the following video stabilization methods:

Harnessing Meta-Learning for Improving Full-Frame Video Stabilization (a.k.a MetaStab, CVPR 2024)

Hybrid Neural Fusion for Full-frame Video Stabilization (ak.k.a FuSta, ICCV 2021)

3D Video Stabilization with Depth Estimation by CNN-based Optimization (a.k.a Deep3D, CVPR 2021)

Auto-directed video stabilization with robust l1 optimal camera paths (a.k.a L1, CVPR 2011)

Learning Video Stabilization Using Optical Flow (a.k.a Yu, CVPR 2020)

Deep Iterative Frame Interpolation for Full-Frame Video Stabilization (a.k.a DIFRINT, ToG 2020)

The video demos are credited to the following works:

Deep Online Fused Video Stabilization (WACV 2022)

Progressively Optimized Local Radiance Fields for Robust View Synthesis (LocalRF, CVPR 2023)

Acknowledgements

This work is partly done during Zinuo You's internship at Huawei Research Center, Switzerland. We thank the members of Computer Vision Lab, Huawei ZRC and VLG, ETH Zurich for their support and discussions. Anpei Chen is supported by the ERC Starting Grant LEGO-3D (850533) and DFG EXC number 2064/1 - project number 390727645.

BibTeX


@article{you2025gavs,
    title={GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering},
    author={You, Zinuo and Georgoulis, Stamatios and Chen, Anpei and Tang, Siyu and Dai, Dengxin},
    journal={arXiv preprint arXiv:2506.23957},
    year={2025}
}