MoGaF : Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping

1 Pohang University of Science and Technology (POSTECH), 2 RLWRLD
(* equal contribution)
MoGaF Teaser

MoGaF enables long-term, stable scene forecasting via motion-aware Gaussian grouping.

Abstract

Forecasting dynamic scenes remains a fundamental challenge in computer vision, as limited observations make it difficult to capture coherent object-level motion and long-term temporal evolution. We present Motion Group-aware Gaussian Forecasting (MoGaF), a framework for long-term scene extrapolation built upon the 4D Gaussian Splatting representation. MoGaF introduces motion-aware Gaussian grouping and group-wise optimization to enforce physically consistent motion across both rigid and non-rigid regions, yielding spatially coherent dynamic representations. Leveraging this structured space-time representation, a lightweight forecasting module predicts future motion, enabling realistic and temporally stable scene evolution. Experiments on synthetic and real-world datasets demonstrate that MoGaF consistently outperforms existing baselines in rendering quality, motion plausibility, and long-term forecasting stability.

Method Overview

MoGaF pipeline.
MoGaF forecasts future dynamic scenes by explicitly structuring the 4D Gaussian space into motion-consistent units before extrapolation. Instead of treating each Gaussian independently, our pipeline first discovers object-level motion groups, then enforces group-level geometric constraints, and finally predicts future trajectories in that structured motion space.
  • Motion-aware Gaussian Grouping. Starting from optimized 4D Gaussians, MoGaF groups primitives into object-level motion units and assigns each group as rigid or non-rigid. The grouping alternates keyframe seeding and feature-space region growing for robust grouping under occlusion and fast motion.
  • Group-wise Motion Optimization. For rigid groups, MoGaF enforces a shared SE(3) motion with rigid anchoring. For non-rigid groups, it applies local motion smoothness over neighboring Gaussians. This reduces per-Gaussian drift and improves temporal coherence.
  • Group-wise Future Forecasting. MoGaF trains a lightweight transformer forecaster per motion group to extrapolate future Gaussian trajectories. With masked motion modeling during training, the model remains stable over longer forecasting horizons.

Qualitative Results

Forecasting on iPhone-DyCheck

Hyper-NeRF comparison

Forecasting on D-NeRF

Quantitative Results

iPhone-Dycheck

MoGaF main table.
Table 1. Forecasting results on iPhone dataset. We forecast frames beyond the observed time and evaluate predicted frames rendered from test camera viewpoints. The leftmost column (Obs. ratio) shows the ratio of input training frames for each model, where 80% and 60% correspond to forecasting the remaining 20% and 40% of frames, respectively.

D-NeRF

MoGaF D-NeRF table at 60 percent observation.
Table 2. Forecasting results on D-NeRF dataset. We use 60% of the frames as training and predict the remaining 40%.
MoGaF D-NeRF table at 80 percent observation.
Table 4 (Supplementary). Forecasting results on D-NeRF dataset. We use 80% of the frames as input and predict the remaining 20%.

Ablation Studies

We analyze two key components: (1) group-wise optimization for coherent object-level motion, and (2) masking-based forecaster training for robust long-horizon extrapolation.

Group-wise Optimization

Ablation table for group-wise optimization.
Table 3(a). Group-wise optimization ablation on iPhone dataset. Removing group-wise optimization degrades both 2D and 3D tracking metrics.

As reported in the main paper, removing group-wise optimization increases 3D tracking error (EPE: 0.220 to 0.250) and lowers both 3D/2D tracking quality (e.g., OA: 79.0 to 74.2). This confirms that enforcing group-level motion coherence is critical for stable long-horizon forecasting.

Masking Strategy Ablation

Ablation table for masking strategy. Masking ablation qualitative figure.
Table 3(b) and supplementary figure. Masking strategy ablation on D-NeRF forecasting. Mask-aware training improves future-frame quality and stabilizes long-term predictions.

The masking strategy improves all three forecasting metrics in the main paper: PSNR rises from 24.68 to 25.87, SSIM from 0.9283 to 0.9357, and LPIPS decreases from 0.0551 to 0.0491. The supplementary qualitative examples further show fewer structural artifacts in unseen future frames with mask-aware training.

Discussion&Limiation

Main paper and supplementary results show that MoGaF consistently improves long-horizon forecasting quality by coupling group-aware optimization with future trajectory prediction. Quantitatively, the gains appear across both iPhone and D-NeRF benchmarks and are further supported by ablation trends: removing group-wise optimization harms tracking quality, while removing masking degrades future-frame fidelity. Qualitatively, MoGaF better preserves object-level motion coherence in dynamic scenes with complex non-rigid behavior. This is most visible in longer rollouts where local Gaussian drift accumulates for weaker baselines.

As discussed with supplementary qualitative evidence, MoGaF still has limitations under challenging long-horizon extrapolation: when future motion is highly ambiguous, partially occluded, or rapidly non-rigid, prediction uncertainty accumulates and can cause temporal drift, local over-smoothing, and occasional structural artifacts in fine object regions. These failure patterns become more visible as the forecast horizon grows far beyond the observed context, indicating that long-term stability under severe motion ambiguity remains an open problem.