PAR3D: A Unified 3D-MLLM with Part-Aware Representation

TL;DR

PAR3D is a unified part-aware 3D-MLLM that understands, reasons about, and grounds both objects and their fine-grained parts in 3D scenes.

Abstract

Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments.

In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries.

Extensive experiments show that PAR3D substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.

Dataset

ScenePart: Part-Level 3D Scene Dataset

ScenePart composes part-annotated 3D objects into synthesized indoor layouts, producing object- and part-level mask annotations in 3D scenes and multi-task language instructions for training and evaluating part-aware 3D-MLLMs.

800scenes

21Kobject masks

44Kpart masks

273Klanguage annotations

ScenePart data construction pipeline — ScenePart provides object masks, part masks, object-part correspondences, scene context, and language-task annotations.

Method

PAR3D Framework

PAR3D supports diverse 3D vision-language tasks over both objects and their parts through scene-level part supervision, part-aware representation learning, and hierarchical segmentation queries.

01

ScenePart Supervision

Object- and part-level masks with object-part correspondences provide dense supervision in complete 3D scenes.

02

Part-Aware Representation Learning

Part-aware contrastive learning and representation-preserving regularization enrich visual features with fine-grained part semantics.

03

Hierarchical Segmentation Queries

Granularity-aware grounding tokens distinguish object-level and part-level targets for unified textual response and mask prediction.

PAR3D framework pipeline — PAR3D is trained in two stages and generates textual responses as well as object or part masks through hierarchical grounding tokens.

Results

Quantitative Results

PAR3D improves both object-level 3D vision-language performance and part-aware scene understanding across referring segmentation, question answering, and dense captioning benchmarks.

Quantitative comparison on object-level benchmarks — PAR3D achieves strong object-level performance across 3D referring segmentation, question answering, and dense captioning benchmarks while operating as a unified generalist 3D-MLLM.

Quantitative comparison on the ScenePart benchmark — PAR3D consistently improves object- and part-level referring segmentation and part-aware question answering on ScenePart, covering object, coarse-part, and fine-part granularities.

Qualitative Results

Part-Aware 3D Scene Understanding

PAR3D demonstrates fine-grained object-part grounding and part-aware reasoning in real and synthetic 3D scenes.

Part-aware referring segmentation and question answering examples — PAR3D handles part-aware instructions on real 3D scans. Blue masks indicate object-level predictions, while red masks indicate part-level predictions, highlighting simultaneous object-part grounding.

Additional Visual Question Answering Comparisons

Each scene below is a live 3D point cloud. Drag to rotate, scroll to zoom.

Additional examples from ScanQA and ScenePart-QA show that PAR3D provides more accurate answers for questions involving object attributes, object parts, spatial relationships, and scene-level context.

Additional Referring Segmentation Comparisons

Each row is one live 3D scene. Drag any panel to rotate; the rest of the row follows once you release.

Additional examples from ScanRefer, Multi3DRefer, and ScenePart-Seg show more accurate referring segmentation across object and part targets. Blue denotes the final target mask for both object-level and part-level segmentation.

Citation

If you find PAR3D useful for your research, please consider citing our work.

@article{dai2026par3d,
  title={PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding},
  author={Dai, Shaohui and Qu, Yansong and Shen, You and Zhang, Shengchuan and Zhang, Miaohui and Cao, Liujuan},
  journal={arXiv preprint arXiv:2606.06485},
  year={2026}
}

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding