Seg "Cup"
Seg "Mitts"
Seg "Cookie"
Seg "Toy""
Seg "Chocolate"
Seg "Broom"
Semantic 4D Gaussians can be used for reconstructing and understanding dynamic scenes captured from a monocular camera, resulting in a better handling of target information with temporal variations than static sences. However, most recent work focuses on the semantics of static scenes. Directly applying them to understand dynamic scenes is impractical, which fails to capture the temporal behaviors and features of dynamic objects. To the best of our knowledge, few existing works focus on semantic comprehension of dynamic scenes based on 3DGS. While demonstrating promising capabilities in simple scenes, it struggles to achieve high-fidelity rendering and accurate semantic features in scenarios where the static background contains significant noise and the dynamic foreground exhibits substantial deformation with intricate textures. Because the same update strategy is applied to all Gaussians, overlooking the distinctions and interaction between dynamic and static distributions. This leads to artifacts and noise during semantic segmentation, especially between dynamic foreground and static background. To address these limitations, we propose the Dual-Hierarchical Optimization(DHO), which consists hierarchical Gaussian flow and hierarchical rendering guidance. The former implements effective separation of static and dynamic rendering and their features. The latter helps mitigate the issue of dynamic foreground rendering distortion in scenes where the static background has complex noise (e.g. the “broom” scene in HyperNeRF dataset). Extensive experiments show that our method consistently outperforms baselines on both synthetic and real-world datasets.
The overall pipeline of our model. We add semantic properties to each Gaussian and obtain the geometric deformation of the Gaussian at each timestamp t through the deformation field. In the coarse stage, Gaussians are subjected to geometric constraints, while in the fine stage, geometric constraints are relaxed and semantic feature constraints are introduced. We utilize dynamic foreground masks obtained from scene priors for hierarchical rendering guidance of the scene, enhancing the rendering quality of dynamic foreground with complex backgrounds.
The following results show the novel rendering views and the extracted semantic feature maps using our method, evaluated on both the real-world HyperNeRF dataset and the synthetic D-NeRF dataset. The visualization of the feature maps is displayed using PCA for dimension reduction.
Split-Cookie | ChickChicken | Americano | Torchocolate |
Jumpingjacks | Standup | Trex | Hook |
Our method achieves excellent semantic segmentation performance not only on real-world datasets but also on synthetic datasets.
Seg "Jacket"
Seg "Helmet"
Seg "Skull"
Seg "Shovels"
Seg "Hands"
Our method outperforms the baseline in terms of rendering quality, semantic feature completeness, and semantic segmentation accuracy. (Our method is on the left, Baseline is on the right)
Multi-Scale "Chickchicken" | Multi-Scale "Broom" | |
Remove "Cookie" | Remove "Lemon" | |