PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

Shaohui Dai* Yansong Qu* † You Shen Shengchuan Zhang Liujuan Cao

Key Laboratory of Multimedia Trusted Perception and Efficient Computing,
Ministry of Education of China, Xiamen University

* Equal contribution.   Project lead.   Corresponding author.

PAR3D teaser figure
PAR3D enables part-aware understanding across question answering, segmentation, and reasoning, going beyond object-level 3D-MLLMs.

TL;DR

PAR3D is a unified part-aware 3D-MLLM that understands, reasons about, and grounds both objects and their fine-grained parts in 3D scenes.

Abstract

Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments.

In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries.

Extensive experiments show that PAR3D substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.

Dataset

ScenePart: Part-Level 3D Scene Dataset

ScenePart composes part-annotated 3D objects into synthesized indoor layouts, producing object- and part-level mask annotations in 3D scenes and multi-task language instructions for training and evaluating part-aware 3D-MLLMs.

800scenes
21Kobject masks
44Kpart masks
273Klanguage annotations
ScenePart data construction pipeline
ScenePart provides object masks, part masks, object-part correspondences, scene context, and language-task annotations.

Method

PAR3D Framework

PAR3D supports diverse 3D vision-language tasks over both objects and their parts through scene-level part supervision, part-aware representation learning, and hierarchical segmentation queries.

01

ScenePart Supervision

Object- and part-level masks with object-part correspondences provide dense supervision in complete 3D scenes.

02

Part-Aware Representation Learning

Part-aware contrastive learning and representation-preserving regularization enrich visual features with fine-grained part semantics.

03

Hierarchical Segmentation Queries

Granularity-aware grounding tokens distinguish object-level and part-level targets for unified textual response and mask prediction.

PAR3D framework pipeline
PAR3D is trained in two stages and generates textual responses as well as object or part masks through hierarchical grounding tokens.

Results

Quantitative Results

PAR3D improves both object-level 3D vision-language performance and part-aware scene understanding across referring segmentation, question answering, and dense captioning benchmarks.

Object-Level Benchmarks

Quantitative comparison on object-level benchmarks
PAR3D achieves strong object-level performance across 3D referring segmentation, question answering, and dense captioning benchmarks while operating as a unified generalist 3D-MLLM.

ScenePart Benchmark

Quantitative comparison on the ScenePart benchmark
PAR3D consistently improves object- and part-level referring segmentation and part-aware question answering on ScenePart, covering object, coarse-part, and fine-part granularities.

Qualitative Results

Part-Aware 3D Scene Understanding

PAR3D demonstrates fine-grained object-part grounding and part-aware reasoning in real and synthetic 3D scenes.

Part-Aware Referring Segmentation and Question Answering

Part-aware referring segmentation and question answering examples
PAR3D handles part-aware instructions on real 3D scans. Blue masks indicate object-level predictions, while red masks indicate part-level predictions, highlighting simultaneous object-part grounding.

Additional Visual Question Answering Comparisons

Additional visual question answering comparisons
Additional examples from ScanQA and ScenePart-QA show that PAR3D provides more accurate answers for questions involving object attributes, object parts, spatial relationships, and scene-level context.

Additional Referring Segmentation Comparisons

Additional referring segmentation comparisons
Additional examples from ScanRefer, Multi3DRefer, and ScenePart-Seg show more accurate referring segmentation across object and part targets. Blue denotes the final target mask for both object-level and part-level segmentation.

Citation

If you find PAR3D useful for your research, please consider citing our work.

@article{dai2026par3d,
  title={PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding},
  author={Dai, Shaohui and Qu, Yansong and Shen, You and Zhang, Shengchuan and Zhang, Miaohui and Cao, Liujuan},
  journal={arXiv preprint arXiv:2606.06485},
  year={2026}
}