Siyuan Huang - Research Scientist, BIGAI

I am a Research Scientist at Beijing Institute for General Artificial Intelligence (BIGAI), where I direct the Center of Embodied AI and Robotics. I teach at Peking University. I received my Ph.D. from Department of Statistics at University of California, Los Angeles (UCLA). During my Ph.D., I have interned at DeepMind and Facebook Reality Lab. Before UCLA, I graduated from Tsinghua University with a Bachelors in Department of Automation.

My research interests lie in computer vision, machine learning, cognition, and robotics.
I currently focus on developing generalizable models for general-purpose robots, especially the language-grounded models for solving the perception, interaction, learning, and planning problem.

Perception: 3D Scene Parsing & Reconstruction & Synthsis; Human Activity Understanding
Interaction: 3D Human-Scene Interaction; Hand-object Interaction; Part-based Interaction
Learning: Neural-symbolic Learning, Structure Learning, Self-supervised Learning, Concept Learning
Planning: Language-grounded Task Learning and Planning, Generalized Planning in Real Scenes

I would like to develop tools to help machines learn 3D representations, percept 3D world, and interact with 3D environments from images or videos. My long-term goal is to build a general-purpose intelligent machine that could understand and interact with the 3D environment like humans.

Prospective students: I co-advise Ph.D. students with researchers at PKU, SJTU, ZJU, USTC, etc through TongProgram. I am looking for self-motivated, independent, creative, and ambitious Ph.D. students for conducting cutting-edge research. Send me CV and research statement if interested.

Hiring: we are hiring research scientists, engineers, and long-term student interns to join our top-tier research team in 3D scene understanding and embodied AI. E-Mail / Google Scholar / Twitter / Thesis

News

10/2025 NEW SceneWeaver won the Best Paper Award at IROS25 RoboGen Workshop
10/2025 NEW We won the first place at the IROS25 Robot Dance Competition, check the full video here!
10/2025 NEW Selected into the New Generation Star Project at IROS25.
10/2025 NEW UniFP won the Best Paper Award at CoRL25, Seoul
10/2025 NEW Invited keynote talk at IROS25 RoboGen: 3D World Generation for Robot Learning and Autonomous Systems workshop!
10/2025 NEW Invited keynote talk at IROS25 2nd AI Meets Autonomy: Vision, Language, and Autonomous Systems Workshop workshop!
09/2025 NEW Invited keynote talk at CoRL25 Beyond Rigid Worlds: Representing and Interacting with Non-Rigid Objects Workshop workshop!
09/2025 NEW We won the champion in the Solo Dance contest at the first World Humanoid Robot Games, check out the full video here!
06/2025 NEW Invited talk at Amazon, Frontier AI & Robotics (FAR).
06/2025 NEW Invited keynote talk at CVPR25 Agents in Interaction, from Humans to Robots workshop!
06/2025 NEW Invited keynote talk at CVPR25 3D-LLM/VLA workshop! Check out the slides here!
03/2025 NEW We are hosting two challenges at CVPR25, including Multimodal Understanding and Reasoning Challenge at 3D scene understanding workshop, and ARNOLD challenge at Embodied AI workshop!
03/2025 NEW Nine papers are accepted by CVPR25 and three papers are accepted by ICRA25, details coming soon!
03/2025 NEW We are organizing the 5th CVPR workshop: 3D Scene Understanding for Vision, Graphics and Robotics!
07/2024 Invited keynote talk at IROS24 AI Meets Autonomy: Vision, Language, and Autonomous Systems workshop! Check the recording here.
07/2024 Four papers are accpeted by ECCV 2024, including SceneVerse, SlotLifter, F-HOI, and PQ3D!
07/2024 I will direct the Joint Lab of Embodied AI and Humanoid Robot between BIGAI and UniTree!
06/2024 Host the 1st workshop on New Trends in Multimodal Human Action Perception, Understanding and Generation at CVPR24!
03/2024 Four papers are accpeted by CVPR 2024, including three highlight papers! Three of them are about human motion modeling and skill learning, one is about 3D scene synthesis for embodied AI!
02/2024 Invited talk at DeepMind Robotics to introduce LEO.
06/2023 Three papers are accepeted by ICCV 2023!
06/2023 We are organizing the 4th CVPR workshop: 3D Scene Understanding for Vision, Graphics and Robotics, welcome to join us!
01/2023 Two papers including SceneDiffuser and GAPartNet are accepted by CVPR 2023! SceneDiffuser code are released!
01/2023 Three papers are accepted by ICLR 2023 with one spotlight!
01/2023 GenDexGrasp is accepted by ICRA 2023. It proposes a generalizable dexterous grasping model for robotics!
10/2021 Invited talk about Compositional Strucutres in Vision and Language at StruCo3D2021 workshop.
06/2021 I am co-organizing the 3rd CVPR 2021 workshop: 3D Scene Understanding for Vision, Graphics and Robotics.
06/2021 Defend my Ph.D. dissertation Human-like Holistic 3D Scene Understanding.
07/2020 Our paper won the Best Paper Award at ICML2020 Workshop on Bridge Between Perception and Reasoning: Graph Neural Networks & Beyond.
07/2020 Two papers accepted by ECCV 2020, one of them as oral presentation.
05/2020 Awarded the UCLA Dissertation Year Fellowship.
03/2020 I am co-organizing the 2nd CVPR 2020 workshop: 3D Scene Understanding for Vision, Graphics and Robotics.

Teaching

Computer Vision (PKU, Fall 2022), from a modern view, co-teach with Yixin Zhu
Early and Mid-level Computer Vision (PKU, Fall 2022, Fall 2023), from a statistical and Marr's view, co-teach with Song-Chun Zhu

Thesis

Human-like Holistic 3D Scene Understanding
Siyuan Huang
UCLA, 2021
Ph.D. Thesis

Publications

Taccel: Scaling Up Vision-based Tactile Robotics via High-performance GPU Simulation
Yuyang Li*, Wenxin Du*, Chang Yu*, Puhao Li, Zihang Zhao, Tengyu Liu, Chenfanfu Jiang^†, Yixin Zhu^†, Siyuan Huang^†
NeurIPS 2025
Spotlight Presentation
Paper / Project / Code
Taccel is a high-performance GPU-based simulator, combining ABD and IPC, for simulating robots with vision-based tactile sensors.

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent
Yandan Yang*, Baoxiong Jia*, Shujie Zhang, Siyuan Huang
NeurIPS 2025
Best Paper Award at IROS25 RoboGen Workshop
Paper / Project / Code
We present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement.

CLONE: Holistic Closed-Loop Humanoid Whole-Body Teleoperation for Long-Horizon Tasks
Yixuan Li*, Yutang Lin*, Jieming Cui, Tengyu Liu, Wei Liang^†, Yixin Zhu^†, Siyuan Huang^†
CoRL 2025
Paper / Project / Code
CLONE is a whole-body teleoperation system that achieves comprehensive robot control using a VR headset. It enables previously unattainable comprehensive skills, such as picking up an object from the ground and placing it in a distant bin, facilitating the collection of long-horizon interaction data.

Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation
Peiyuan Zhi*, Peiyang Li*, Jianqin Yin, Baoxiong Jia^†, Siyuan Huang^†
CoRL 2025
Best Paper Award
Paper / Project / Code
We propose a unified control policy for legged robots that jointly models force and position control learned without force sensors. The policy enables a wide range of manipulation behaviors under varying combinations of force and position inputs, including position tracking, force application, force tracking, and compliant robot behaviors.

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models
Puhao Li, Yingying Wu, Ziheng Xi, Wanlin Li, Yuzhe Huang, Zhiyuan Zhang, Yinghan Chen, Jianan Wang, Song-Chun Zhu, Tengyu Liu^†, Siyuan Huang^†
CoRL 2025
Paper / Project / Code
ControlVLA bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning.

Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing
Hongyu Shen*, Junfeng Ni, Yixin Chen ^†, Weishuo Li, Mingtao Pei, Siyuan Huang^†
ICCV 2025
Paper / Project / Code
We introduce Gaussian Instance Tracing (GIT), which augments the standard Gaussian representation with an instance weight matrix across input views.

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng,
Siyuan Huang^†, Qing Li^†
ICCV 2025
Highlight
Paper / Project / Code
We introduce Move to Understand (MTU3D), a unified framework that integrates active perception with 3D vision-language learning, enabling embodied agents to effectively explore and understand their environment.

Towards Scalable Gaussian World Models for Robotic Manipulation
Guanxing Lu*, Baoxiong Jia*, Puhao Li*, Yixin Chen, Ziwei Wang, Yansong Tang, Siyuan Huang
ICCV 2025
Paper / Project / Code
Gaussian World Model (GWM) is a novel branch of world model that predicts dynamic future states and enables robotic manipulation based on the 3D Gaussian Splatting representation.

TACO: Taming Diffusion for in-the-wild Video Amodal Completion
Ruijie Lu, Yixin Chen^†, Jiaxiang Tang, Junfeng Ni, Diwen Wan, Gang Zeng, Siyuan Huang^†
ICCV 2025
Paper / Project / Code
We introduce TACO, which repurposes pre-trained video diffusion models for Video Amodal Completion (VAC), facilitating downstream tasks like reconstruction.

Ag2x2: Robust Agent-Agnostic Visual Representations for Zero-Shot Bimanual Manipulation
Ziyin Xiong*, Yinghan Chen*, Puhao Li, Yixin Zhu, Tengyu Liu^†, Siyuan Huang^†
IROS 2025
Paper / Project / Code
Ag2x2 is a computational framework for bimanual manipulation through coordination-aware visual representations that jointly encode object states and hand motion patterns while maintaining agent-agnosticism.

RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning
Haoran Geng*, Feishi Wang*, Songlin Wei*, Yuyang Li*, Bangjun Wang*, Boshi An*, Charlie Tianyue Cheng*, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gong, Yuran Wang, Yuxuan Kuang, Ruihai Wu, Baoxiong Jia, Carlo Sferrazza, Hao Dong, Siyuan Huang^†, Yue Wang^†, Jitendra Malik^†, Pieter Abbeel^†
RSS 2025
Oral Presentation
Best Open-source Award at IROS25 RoboGen Workshop
Paper / Project / Code
We introduce RoboVerse, a comprehensive framework comprising a simulation platform, a synthetic dataset, and unified benchmarks for scalable and generalizable robot learning.

ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning
Kailin Li, Puhao Li, Tengyu Liu, Yuyang Li, Siyuan Huang
CVPR 2025
Paper / Project / Code
We introduce ManipTrans, a novel two-stage method for efficiently transferring human bimanual skills to dexterous robotic hands in simulation. Leveraging ManipTrans, we transfer multiple hand-object datasets to robotic hands, creating DexManipNet, a large-scale dataset featuring previously unexplored tasks like pen capping and bottle unscrewing.

MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans
Huangyue Yu*, Baoxiong Jia*, Yixin Chen*, Yandan Yang, Rongpeng Su, Jiaxin Li, Qing Li, Wei Liang, Song-Chun Zhu, Tengyu Liu,
Siyuan Huang
CVPR 2025
Paper / Project / Code
We present METASCENES, a large-scale 3D scene dataset constructed from real-world scans. It features 706 scenes with 15,366 objects across a wide array of types, arranged in realistic layouts, with visually accurate appearances and physical plausibility.

GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill
Jieming Cui*, Tengyu Liu*, Ziyu Meng, Jiale Yu, Ran Song, Wei Zhang, Yixin Zhu^†, Siyuan Huang^†
CVPR 2025
Oral Presentation
Paper / Project / Code
GROVE generates open-vocabulary and physical-plausible motions through generalized reward. We introduce a generalized reward framework that enables open-vocabulary physical skill learning without manual engineering or task-specific demonstrations.

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
Jiangyong Huang*, Baoxiong Jia*, Ziyu Zhu, Yan Wang, Xiongkun Linghu, Qing Li, Song-Chun Zhu, Siyuan Huang
CVPR 2025
Paper / Project / Code
We propose Beacon3D, a benchmark for 3D-VL grounding and QA tasks, delivering a perspective shift in the evaluation of 3D-VL understanding.

Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding
Yan Wang*, Baoxiong Jia*, Ziyu Zhu, Siyuan Huang
CVPR 2025
Paper / Project
We propose MPEC (Masked Point-Entity Contrast) for open-vocabulary 3D scene understanding. MPEC achieves state-of-the-art on open-vocabulary 3D semantic segmentation and is more robust to tail classes, visual ambiguity and detailed descriptions.

Decompositional Neural Scene Reconstruction with Generative Diffusion Prior
Junfeng Ni, Yu Liu, Ruijie Lu, Zirui Zhou, Song-Chun Zhu, Yixin Chen^†, Siyuan Huang^†
CVPR 2025
Paper / Project / Code
DP-Recon incorporates diffusion priors for decompositional neural scene reconstruction to enhance reconstruction quality in sparsely captured and heavily occluded regions.

MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
Ruijie Lu*, Yixin Chen*^†, Junfeng Ni, Baoxiong Jia, Yu Liu, Diwen Wan, Gang Zeng, Siyuan Huang^†
CVPR 2025
Paper / Project / Code
MOVIS is able to synthesize novel views of indoor scenes with multiple objects. It can also match a significantly greater number of points, closely aligned with the ground truth.

Dynamic Motion Blending for Versatile Motion Editing
Nan Jiang*, Hongjie Li*, Ziye Yuan*, Zimo He, Yixin Chen, Tengyu Liu, Yixin Zhu^†, Siyuan Huang^†
CVPR 2025
Paper / Project / Code
MotionReFit enables seamless spatial and temporal motion edits through textual instructions. Powered by our MotionCutMix training strategy, MotionReFit leverages abundant unannotated motion data to augment scarce editing triplets.

InteractAnything: Zero-shot Human Object-Interaction Synthesis via LLM Feedback and Object Affordance Parsing
Jinlu Zhang, Yixin Chen^†, Zan Wang, Jie Yang, Yizhou Wang^†, Siyuan Huang^†
CVPR 2025
Highlight
Paper / Project / Code
Our method enables the generation of diverse, detailed, and novel interactions for open-set 3D objects. Given a simple text description with goal interaction and any object mesh as input, we can synthesize different natural HOI results without training on 3D assets.

Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V
Peiyuan Zhi*, Zhiyuan Zhang*, Muzhi Han, Zeyu Zhang, Zhitian Li, Ziyuan Jiao, Baoxiong Jia^†, Siyuan Huang^†
ICRA 2025
Paper / Project
We present COME-robot, the first closed-loop framework utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios.

PhysPart: Physically Plausible Part Completion for Interactable Objects
Rundong Luo*, Haoran Geng*, Congyue Deng, Puhao Li, Zan Wang, Baoxiong Jia, Leonidas Guibas, Siyuan Huang
ICRA 2025
Paper
We propose a diffusion-based part generation model that utilizes geometric conditioning through classifier-free guidance and formulates physical constraints as a set of stability and mobility losses to guide the sampling process.

SYNERGAI: Perception Alignment for Human-Robot Collaboration
Yixin Chen*, Guoxi Zhang*, Yaowei Zhang, Hongming Xu, Peiyuan Zhi, Qing Li, Siyuan Huang
ICRA 2025
Paper / Project
we introduce SYNERGAI, a unified system designed to achieve both perceptual alignment and human-robot collaboration.

ArtGS: Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting
Yu Liu*, Baoxiong Jia*, Ruijie Lu, Song-Chun Zhu, Siyuan Huang
ICLR 2025
Paper / Project
We introduce ArtGS, a novel approach that leverages 3D Gaussians as a flexible and efficient representation to reconstruct articulated objects.

PhyRecon: Physically Plausible Neural Scene Reconstruction
Junfeng Ni*, Yixin Chen*, Bohan Jing, Nan Jiang, Bin Wang, Bo Dai, Puhao Li, Yixin Zhu, Song-Chun Zhu, Siyuan Huang
NeurIPS 2024
Paper / Project / Code
we introduce PhyRecon, the first approach to leverage both differentiable rendering and differentiable physics simulation to learn implicit surface representations for neural scene reconstruction.

Multi-modal Situated Reasoning in 3D Scenes
Xiongkun Linghu*, Jiangyong Huang*, Xuesong Niu, Xiaojian Ma, Baoxiong Jia^†, Siyuan Huang^†
NeurIPS 2024 (Dataset and Benchmark Track)
Paper / Project / Code
we propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes.

Autonomous Character-Scene Interaction Synthesis from Text Instruction
Nan Jiang*, Zimo He*, Zi Wang, Hongjie Li, Yixin Chen, Tengyu Liu, Siyuan Huang ^†, Yixin Zhu^†,
SIGGRAPH Asia 2024
Paper / Project
We propose the first unified model for autonomous character-scene interaction synthesis from text instruction.

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
Baoxiong Jia*, Yixin Chen*, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang
ECCV 2024
Paper / Project / Code
We introduce the first million-scale 3D vision-language dataset, SceneVerse, to facilitate 3D vision-language research.

SlotLifter: Slot-guided Feature Lifting for Learning Object-centric Radiance Fields
Yu Liu*, Baoxiong Jia*, Yixin Chen, Siyuan Huang
ECCV 2024
Paper / Project
We propose SlotLifter, a novel object-centric radiance model addressing scene reconstruction and decomposition jointly via slot-guided feature lifting.

Unifying 3D Vision-Language Understanding Via Promptable Queries
Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang^†, Qing Li^†
ECCV 2024
Paper / Project / Code
We introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning.

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
Jie Yang*, Xuesong Niu*, Nan Jiang*, Ruimao Zhang^†, Siyuan Huang^†,
ECCV 2024
Paper / Project
We propose a unified model called F-HOI, designed to leverage multimodal instructions and empower the Multi-modal Large Language Model to efficiently handle diverse HOI tasks.

Grasp Multiple Objects with One Hand
Yuyang Li, Bo Liu, Yiran Geng, Puhao Li, Yaodong Yang, Yixin Zhu, Tengyu Liu^†, Siyuan Huang^†
IROS & RA-L 2024
Oral Presentation
Paper / Project / Code
We propose a novel hand-agnostic grasping algorithm for generalizable dexterous grasping.

Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations
Puhao Li*, Tengyu Liu*, Yuyang Li, Muzhi Han, Haoran Geng, Shu Wang, Yixin Zhu, Song-Chun Zhu, Siyuan Huang
IROS 2024
Oral Pitch
Paper / Project / Code
We introduce Ag2Manip (Agent-Agnostic representations for Manipulation) for learning novel manipulation skills.

An Embodied Generalist Agent in 3D World
Jiangyong Huang*, Silong Yong*, Xiaojian Ma*, Xiongkun Linghu*, Puhao Li, Yan Wang,
Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang
ICML 2024
Paper / Project / Code
We introduce LEO, an embodied multi-modal and multi-task generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world.

PHYSCENE: Physically Interactable 3D Scene Synthesis for Embodied AI
Yandan Yang*, Baoxiong Jia*, Peiyuan Zhi, Siyuan Huang
CVPR 2024
Highlight, 2.8%
Paper / Project
We introduce PhyScene, a novel method dedicated to generating interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents.

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance
Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jinze Zhang, Tengyu Liu, Yixin Zhu^† , Wei Liang^†, Siyuan Huang^†
CVPR 2024
Highlight, 2.8%
Paper / Project / Code
We introduce a novel two-stage framework that employs scene affordance as an intermediate representation, effectively linking 3D scene grounding and conditional motion generation.

Scaling Up Dynamic Human-Scene Interaction Modeling
Nan Jiang*, Zhiyuan Zhang*, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu^† , Siyuan Huang ^†
CVPR 2024
Highlight, 2.8%
Paper / Project / Live Demo / Code
We introduce the TRUMANS dataset as the most comprehensive motion-captured HSI dataset currently available, encompassing over 15 hours of human interactions across 100 indoor scenes. We also present the first model that scales up human-scene interaction modeling and achieves remarkable performance in generation.

AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents
Jiemin Cui*, Tengyu Liu*, Nian Liu*, Yaodong Yang, Yixin Zhu^† , Siyuan Huang^†
CVPR 2024
Paper / Project / Code
We propose AnySkill, a novel hierarchical method that learns physically plausible interactions following open-vocabulary instructions. AnySkill is the first method capable of open-vocabulary physical skill learning for interactive humanoid agents.

Neural-Symbolic Recursive Machine for Systematic Generalization
Qing Li, Yixin Zhu, Yitao Liang, Ying Nian Wu, Song-Chun Zhu, Siyuan Huang
ICLR 2024
Paper
We propose Neural-Symbolic Recursive Machine (NSR) for learning compositional rules from limited data and applying them to unseen combinations in various domains. NSR achieves 100% generalization accuracy on SCAN and PCFG and outperforms state-of-the-art models on HINT by about 23%.

Single-view 3D Scene Reconstruction with High-fidelity Shape and Texture
Yixin Chen*, Junfeng Ni*, Nan Jiang, Yaowei Zhang, Yixin Zhu, Siyuan Huang
3DV 2024
Paper / Project / Code
We introduce MultiGrasp, a two-stage method for multi-object grasping on a tabletop with a multi-finger dexterous hand. It involves (i) generating pre-grasp proposals and (ii) executing the grasp and lifting the objects.

ARNOLD: A Benchmark for Language-Grounded Task Learning with Continuous States in Realistic Scenes
Ran Gong*, Jiangyong Huang*, Yizhou Zhao, Haoran Geng, Xiaofeng Gao, Qingyang Wu, Wensi Ai, Ziheng Zhou, Demetri Terzopoulos, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang
ICCV 2023
CoRL 2022 Workshop on Language and Robot Learning
Spotlight Presentation
Arxiv / Project / Code
We present ARNOLD, a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes. ARNOLD consists of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals.

Full-Body Articulated Human-Object Interaction
Nan Jiang*, Tengyu Liu*, Zhexuan Cao, Jieming Cui, Zhiyuan Zhang, He Wang, Yixin Zhu^† , Siyuan Huang^†
ICCV 2023
Paper / Project / Code
We present CHAIRS, a large-scale motion-captured f-AHOI dataset, consisting of 16.2 hours of versatile interactions between 46 participants and 74 articulated and rigid sittable objects. CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process, as well as realistic and physically plausible full-body interactions.

3D-VISTA: Pre-trained Transformer for 3D Vision and Text Alignment
Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang^†, Qing Li ^†
ICCV 2023
Paper / Project / Code
We propose 3D-VisTA, a foundation model that can be easily adapted to various 3D vision-language tasks.

Diffusion-based Generation, Optimization, and Planning in 3D Scenes
Siyuan Huang*, Zan Wang*, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, Song-Chun Zhu
CVPR 2023
Paper / Project / Code
We introduce SceneDiffuser, a conditional generative model for 3D scene understanding. SceneDiffuser provides a unified model for solving scene-conditioned generation, optimization, and planning. In contrast to prior work, SceneDiffuser is intrinsically scene-aware, physics-based, and goal-oriented.

GAPartNet: Learning Generalizable and Actionable Parts for Cross-Category Object Perception and Manipulation
Haoran Geng*, Helin Xu*, Chengyang Zhao*, Chao Xu, Li Yi, Siyuan Huang, He Wang
CVPR 2023
Highlight, 2.5%
Paper / Project / Code
We propose to learn generalizable object perception and manipulation skills via Generalizable and Actionable Parts, and present GAPartNet, a large-scale interactive dataset with rich part annotations.

A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics
Qing Li, Siyuan Huang, Yining Hong, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu
ICLR 2023
Spotlight, 5.7%
Paper
we present a new dataset, HINT, to study machines' capability of learning generalizable concepts at three different levels: perception, syntax, and semantics.

SQA3D: Situated Question Answering in 3D Scenes
Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li , Yitao Liang, Song-Chun Zhu, Siyuan Huang
ICLR 2023
Paper / Project / Code / Slides
We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation.

Improving Object-centric Learning with Query Optimization
Baoxiong Jia*, Yu Liu*, Siyuan Huang
ICLR 2023
Paper / Project / Code
Our model, Bi-level Optimized Query Slot Attention, achieves state-of-the-art results on 3 challenging synthetic and 7 complex real-world datasets in unsupervised image segmentation and reconstruction, outperforming previous baselines by a large margin.

GenDexGrasp: Generalizable Dexterous Grasping
Puhao Li*, Tengyu Liu*, Yuyang Li, Yiran Geng, Yixin Zhu, Yaodong Yang, Siyuan Huang
ICRA 2023
Paper / Project / Code
We propose a novel hand-agnostic grasping algorithm for generalizable dexterous grasping.

HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes
Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Wei Liang, Siyuan Huang
NeurIPS 2022
Paper / Project / Code
We present a novel scene-and-language conditioned generative model that can produce 3D human motions of the desirable action interacting with the specified objects. For example, sit on the armchair near the desk.To fill in the gap. We collect a large-scale and semantic-rich synthetic human-scene interaction dataset, denoted as HUMANISE, to enable such langauge-conditoned scene understanding tasks.

EgoTaskQA: Understanding Human Tasks in Egocentric Videos
Baoxiong Jia , Ting Lei, Song-Chun Zhu, Siyuan Huang
NeurIPS 2022 (Dataset and Benchmark Track)
Paper / Project / Code
we introduce the EgoTaskQA benchmark that8 provides a single home for the crucial dimensions of task understanding through question answering on real-world egocentric videos. We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.

PartAfford: Part-level Affordance Discovery from 3D Objects
Chao Xu , Yixin Chen, He Wang, Song-Chun Zhu, Yixin Zhu, Siyuan Huang
ECCV 2022 Visual Object-oriented Learning meets Interaction (VOLI) Workshop
Paper
We present a new task of part-level affordance discovery (PartAfford): Given only the affordance labels per object, the machine is tasked to (i) decompose 3D shapes into parts and (ii) discover how each part of the object corresponds to a certain affordance category.

Learning V1 simple cells with vector representations of local contents and matrix representations of local motions
Ruiqi Gao , Jianwen Xie , Siyuan Huang, Yufan Ren, Song-Chun Zhu, Ying Nian Wu
AAAI 2022
Paper
we propose a representational model that couples the vector representations of local image contents with the matrix representations of local pixel displacements. When the image changes from one time frame to the next due to pixel displacements, the vector at each pixel is rotated by a matrix that represents the displacement of this pixel.

VLGrammar: Grounded Grammar Induction of Vision and Language
Yining Hong, Qing Li, Song-Chun Zhu, Siyuan Huang
ICCV 2021
Paper / Supplementary / Code
We study grounded grammar induction of vision and language in a joint learning framework.

YouRefIt: Embodied Reference Understanding with Language and Gesture
Yixin Chen, Qing Li, Deqian Kong, Yik Lun Kei, Tao Gao , Yixin Zhu, Song-Chun Zhu, Siyuan Huang
ICCV 2021
Oral Presentation
Paper / Supplementary / Project / Code
We study the machine's understanding of embodied reference: One agent uses both language and gesture to refer to an object to another agent in a shared physical environment.

Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds
Siyuan Huang*, Yichen Xie* Song-Chun Zhu, Yixin Zhu
ICCV 2021
Paper / Supplementary / Project / Code
We introduce a spatio-temporal representation learning (STRL) framework, capable of learning from unlabeled 3D point clouds in a self-supervised fashion.

Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning
Pan Lu*, Ran Gong*, Shibiao Jiang*, Liang Qiu, Siyuan Huang, Xiaodan Liang, Song-Chun Zhu
ACL 2021
Oral Presentation
Paper / Code / Project / Bibtex
we construct a new largescale benchmarkconsisting of 3,002 geometry problems with dense annotation in formal language and propose a novel geometry solving approach with formal language and symbolic reasoning.

Learning Neural Representation of Camera Pose with Matrix Representation of Pose Shift via View Synthesis
Yaxuan Zhu, Ruiqi Gao, Siyuan Huang, Song-Chun Zhu, Ying Nian Wu
CVPR 2021
Oral Presentation
Paper / Supplementary / Code
To efficiently represent camera pose in 3D computer vision, we propose an approach to learn neural representations of camera poses and 3D scenes, coupled with neural representations of local camera movements.

SMART: A Situation Model for Algebra Story Problems via Attributed Grammar
Yining Hong, Qing Li, Ran Gong, Daniel Ciao, Siyuan Huang, Song-Chun Zhu
AAAI 2021
Paper / Project / Bibtex
We propose SMART, which adopts attributed grammar as the representation of situation models for solving the algebra story problems.

Learning by Fixing: Solving Math Word Problems with Weak Supervision
Yining Hong, Qing Li, Daniel Ciao, Siyuan Huang, Song-Chun Zhu
AAAI 2021
Paper / Supplementary / Code / Project / Bibtex
We introduce a weakly-supervised paradigm for learning math word problems. Our method only requires the annotations of the final answers and can generate various solutions for a single problem.

A Competence-aware Curriculum for Visual Concepts Learning via Question Answering
Qing Li , Siyuan Huang, Yining Hong, Song-Chun Zhu
ECCV 2020
Oral Presentation
Paper
We design a neural-symbolic concept learner for learning the visual concepts and a multi-dimensional Item Response Theory (mIRT) model for guiding the visual concept learning process with an adaptive curriculum.

LEMMA: A Multi-view Dataset for Learning Multi-agent Multi-task Activities
Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, Song-Chun Zhu
ECCV 2020
Paper / Code / Project / Bibtex
We introduce the LEMMA dataset to provide a single home to address these missing dimensions with carefully designed settings, wherein the numbers of tasks and agents vary to highlight different learning objectives. We densely annotate the atomic-actions with human-object interactions to provide ground-truth of the compositionality, scheduling, and assignment of daily activities.

Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning
Qing Li , Siyuan Huang, Yining Hong, Yixin Chen, Ying Nian Wu, Song-Chun Zhu
ICML 2020
Best Paper Award at Workshop on Bridge Between Perception and Reasoning: Graph Neural Networks & Beyond.
Paper / Supplementary / Code / Project / Bibtex
We close the loop of neural-symbolic learning by introducing the grammar}model as a symbolic prior to bridge neural perception and symbolic reasoning, and proposing a novel back-search algorithm which mimics the top-down human-like learning procedure to propagate the error through the symbolic reasoning module efficiently.

Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense
Yixin Zhu , Tao Gao , Lifeng Fan , Siyuan Huang, Edmonds Mark, Hangxin Liu, Feng Gao, Chi Zhang, Siyuan Qi, Ying Nian Wu, Josh Tenenbaum, Song-Chun Zhu
Engineering 2020
Paper
We demonstrate the power of this perspective to develop cognitive AI systems with humanlike common sense by showing how to observe and apply FPICU with little training data to solve a wide range of challenging tasks, including tool use, planning, utility inference, and social learning.

A Generalized Earley Parser for Human Activity Parsing and Prediction
Siyuan Qi , Baoxiong Jia , Siyuan Huang, Ping Wei, Song-Chun Zhu
TPAMI 2020
Paper
Propose an algorithm to tackle the task of understanding complex human activities from (partially observed) videos from two important aspects: activity recognition and prediction.

PerspectiveNet: 3D Object Detection from a Single RGB Image via Perspective Points
Siyuan Huang, Yixin Chen, Tao Yuan, Siyuan Qi, Yixin Zhu, Song-Chun Zhu
Neural Information Processing Systems (NeurIPS) 2019
Paper
To solve the problem of 3D object detection, we propose perspective points as a novel intermediate representation, defined as the 2D projections of locally-Manhattan 3D keypoints to locate an object, and they satisfy certain geometric constraints caused by the perspective projection.

Holistic++ Scene Understanding: Single-view 3D Holistic Scene Parsing and Human Pose Estimation with Human-Object Interaction and Physical Commonsense
Yixin Chen *, Siyuan Huang *, Tao Yuan, Siyuan Qi, Yixin Zhu, Song-Chun Zhu
IEEE International Conference on Computer Vision (ICCV) 2019
* Equal contributions
Paper / Supplementary / Project
Propose a new 3D holistic++ scene understanding problem, which jointly tackles two tasks from a single-view image: (i) holistic scene parsing and reconstruction- and (ii) 3D human pose estimation. We incorporate the human-object interaction (HOI) and physical commonsense to tackle this problem.

Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning
Lifeng Fan *, Wenguan Wang *, Siyuan Huang, Xinyu Tang, Song-Chun Zhu
IEEE International Conference on Computer Vision (ICCV) 2019
* Equal contributions
Paper
Propose a new problem of understanding human gaze communication in social videos from both atomic-level and event-level, which is significant for studying human social interactions.

Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation
Siyuan Huang, Siyuan Qi, Yinxue Xiao, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu
Neural Information Processing Systems (NeurIPS) 2018
Paper / Supplementary / Poster / Video / Code / Project
Propose an end-to-end model that simultaneously solves tasks of 3D object detection, 3D layout estimation and camera pose estimation in real-time given only a single RGB image

Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image
Siyuan Huang, Siyuan Qi, Yixin Zhu, Yinxue Xiao, Yuanlu Xu, Song-Chun Zhu
European Conference on Computer Vision (ECCV) 2018
Paper / Supplementary / Project / Code / Poster / Bibtex
Propose a computational framework to parse and reconstruct the 3D configuration of an indoor scene from a single RGB image in an analysis-by-synthesis fasion using a stochastic grammar model.

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-Pixel Ground Truth using Stochastic Grammars

Chenfanfu Jiang *, Siyuan Qi *, Yixin Zhu *, Siyuan Huang *, Jenny Lin, Xingwen Guo, Lap-Fai Yu, Demetri Terzopoulos, Song-Chun Zhu

* Equal contributions
Internatianal Journal of Computer Vision (IJCV) 2018
Paper / Demo
Employ physics-based rendering to synthesize photorealistic RGB images while automatically synthesizing detailed,per-pixel ground truth data, including visible surface depth and normal, object identity and material information, as well as illumination.

Human-centric Indoor Scene Synthesis using Stochastic Grammar
Siyuan Qi, Yixin Zhu , Siyuan Huang, Chenfanfu Jiang , Song-Chun Zhu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018
Paper / Project / Code / Bibtex
Present a human-centric method to sample and synthesize 3D room layouts and 2D images thereof, for the purpose of obtaining large-scale 2D/3D image data with the perfect per-pixel ground truth.

Predicting Human Activities Using Stochastic Grammar
Siyuan Qi, Siyuan Huang, Ping Wei, Song-Chun Zhu
IEEE International Conference on Computer Vision (ICCV) 2017
Paper / Bibtex / Code
Use a stochastic grammar model to capture the compositional structure of events, integrating human actions, objects, and their affordances for modeling the rich context between human and environment.

Nonlinear Local Metric Learning for Person Re-identification
Siyuan Huang, Jiwen Lu, Jie Zhou, Anil K. Jain
arXiv 2015
arXiv Paper
Utilize the merits of both local metric learning and deep neural network to exploit the complex nonlinear transformations in the feature space of person re-identification data.

Building Change Detection Based on 3D reconstruction
Baohua Chen, Lei Deng, Yueqi Duan, Siyuan Huang, Jie Zhou
IEEE International Conference on Image Processing (ICIP) 2015
Paper / Bibtex
Propose a change detection framework based on RGB-D map generated by 3D reconstruction which can overcome the large illumination changes .