Papers
arxiv:2512.17012

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Published on Dec 18
ยท Submitted by
Min-Hung Chen
on Dec 22
ยท nvidia NVIDIA
Authors:
,
,
,
,
,
,

Abstract

4D-RGPT, a specialized multimodal LLM, enhances 4D perception in video inputs through Perceptual 4D Distillation and is evaluated on R4D-Bench, a new benchmark for depth-aware dynamic scenes.

AI-generated summary

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

Community

Paper author Paper submitter
โ€ข
edited about 17 hours ago

Project page: https://www.ca-joe-yang.com/resource/projects/4D_RGPT

  • We propose 4D-RGPT, a specialized MLLM that perceives 4D information for enhanced video understanding.
  • We propose the Perceptual 4D Distillation (P4D) training framework to distill 4D perceptual knowledge into 4D-RGPT without introducing additional inference cost.
  • We introduce R4D-Bench, a region-based 4D VQA benchmark that requires region-level 4D understanding.
  • Our 4D-RGPT improves over the baseline on both non-region-based 3D/4D benchmarks (+5.3% on average across 6 benchmarks) and our region-based R4D-Bench benchmark (+4.3%), while effectively capturing explicit 4D signals.

arXiv lens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/4d-rgpt-toward-region-level-4d-understanding-via-perceptual-distillation-1806-b2c771e8

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.17012 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.17012 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.17012 in a Space README.md to link it from this page.

Collections including this paper 2