News
- 06/2023 We are organizing the 4th CVPR 2023 workshop: 3D Scene Understanding for Vision, Graphics and Robotics, welcome to join us!
- 01/2023 NEW Two papers including SceneDiffuser and GAPartNet are accepted by CVPR 2023! SceneDiffuser code are released!
- 01/2023 NEW Three papers are accepted by ICLR 2023 with one spotlight!
- 01/2023 NEW GenDexGrasp is accepted by ICRA 2023. It proposes a generalizable dexterous grasping model for robotics!
- 12/2022 ARNOLD is accepted as Spotlight by CoRL 2022 Workshop on Language and Robot Learning!
- 09/2022 Two papers accepted by NeurIPS 2022. HUMANISE is the first work about language-conditioned motion generation in 3D scenes, EgoTaskQA introduces the first QA benchmark for understanding goal-oriented human task from egocentric video.
- 10/2021 Invited talk about Compositional Strucutres in Vision and Language at StruCo3D2021 workshop.
- 07/2021 Three papers accepeted by ICCV21, including one oral presentation about embodied reference understanding.
- 06/2021 I am co-organizing the 3rd CVPR 2021 workshop: 3D Scene Understanding for Vision, Graphics and Robotics.
- 06/2021 Defend my Ph.D. dissertation Human-like Holistic 3D Scene Understanding.
- 05/2021 Two papers about embodied reference understanding and systematic generalization is accepted by ICLR 2021 workshop.
- 02/2021 One paper about neural representation of camera pose is accepted by CVPR 2021 as oral.
- 08/2020 Two papers about neural-symbolic learning and math word problems are accepted by AAAI 2021.
- 08/2020 LEMMA dataset and code released.
- 07/2020 Our paper won the Best Paper Award in ICML2020 Workshop on Bridge Between Perception and Reasoning: Graph Neural Networks & Beyond.
- 07/2020 Two papers accepted by ECCV 2020, one of them as oral presentation.
- 05/2020 Awarded the UCLA Dissertation Year Fellowship.
- 05/2020 One paper about neural symbolic learning accepted by ICML 2020. Check the project and for more details including the code.
- 05/2020 Our CVPR workshop will be totally virtual. Please check the scene understanding workshop for more details.
- 03/2020 I am co-organizing the 2nd CVPR 2020 workshop: 3D Scene Understanding for Vision, Graphics and Robotics.
Teaching
- Computer Vision (PKU, Fall 2022), from a modern view, co-teach with Yixin Zhu
- Early and Mid-level Computer Vision (PKU, Fall 2022), from a statistical and Marr's view, co-teach with Song-Chun Zhu
Preprints
|
ARNOLD: A Benchmark for Language-Grounded Task Learning with Continuous States in Realistic Scenes
Ran Gong*,
Jiangyong Huang*,
Yizhou Zhao,
Haoran Geng,
Xiaofeng Gao,
Qingyang Wu,
Wensi Ai,
Ziheng Zhou,
Demetri Terzopoulos,
Song-Chun Zhu,
Baoxiong Jia,
Siyuan Huang
CoRL 2022 Workshop on Language and Robot Learning (Spotlight Presentation)
Arxiv / Project / Code
We present ARNOLD, a
benchmark that evaluates language-grounded task learning with continuous states
in realistic 3D scenes. ARNOLD consists of 8 language-conditioned tasks that
involve understanding object states and learning policies for continuous goals.
|
|
CHAIRS: Towards Full-Body Articulated Human-Object Interaction
Nan Jiang*,
Tengyu Liu*,
Zhexuan Cao,
Jieming Cui,
He Wang,
Yixin Zhu† ,
Siyuan Huang†
Arxiv
Paper / Project / Code
We present CHAIRS, a large-scale motion-captured f-AHOI dataset, consisting of 16.2 hours of versatile interactions between 46 participants and 74 articulated and rigid sittable objects. CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process, as well as realistic and physically plausible full-body interactions.
|
|
Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation
Jiangyong Huang*, William Yicheng Zhu*,
Baoxiong Jia,
Zan Wang,
Xiaojian Ma,
Qing Li,
Siyuan Huang
Arxiv
Paper
We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four disjoint functional domains: Perceive, Ground, Reason, and Act.
|
|
Neural-Symbolic Recursive Machine for Systematic Generalization
Qing Li,
Yixin Zhu,
Yitao Liang,
Ying Nian Wu,
Song-Chun Zhu,
Siyuan Huang
Arxiv
Paper
We propose Neural-Symbolic Recursive Machine (NSR) for learning compositional rules from limited data and applying them to unseen combinations in various domains. NSR achieves 100% generalization accuracy on SCAN and
PCFG and outperforms state-of-the-art models on HINT by about 23%.
|
Publications
|
Diffusion-based Generation, Optimization, and Planning in 3D Scenes
Siyuan Huang*,
Zan Wang*,
Puhao Li,
Baoxiong Jia,
Tengyu Liu,
Yixin Zhu,
Wei Liang,
Song-Chun Zhu
CVPR 2023
Paper / Project / Code
We introduce SceneDiffuser, a conditional generative model for 3D scene understanding. SceneDiffuser provides a unified model for solving scene-conditioned generation, optimization, and planning. In contrast to prior work, SceneDiffuser is intrinsically scene-aware, physics-based, and goal-oriented.
|
|
GAPartNet: Learning Generalizable and Actionable Parts for Cross-Category Object Perception and Manipulation
Haoran Geng*, Helin Xu*, Chengyang Zhao*,
Chao Xu,
Li Yi,
Siyuan Huang,
He Wang
CVPR 2023 (Highlight)
Paper / Project / Code
We propose to learn generalizable object perception and manipulation skills via Generalizable and Actionable Parts,
and present GAPartNet, a large-scale interactive dataset with rich part annotations.
|
|
A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics
Qing Li,
Siyuan Huang,
Yining Hong,
Yixin Zhu,
Ying Nian Wu,
Song-Chun Zhu
ICLR 2023 (Spotlight Presentation)
Paper
we present a new dataset, HINT, to study machines' capability of learning generalizable concepts at three different levels: perception, syntax, and semantics.
|
|
SQA3D: Situated Question Answering in 3D Scenes
Xiaojian Ma, Silong Yong,
Zilong Zheng,
Qing Li ,
Yitao Liang,
Song-Chun Zhu,
Siyuan Huang
ICLR 2023
Paper / Project / Code / Slides
We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation.
|
|
Improving Object-centric Learning with Query Optimization
Baoxiong Jia*,
Yu Liu*,
Siyuan Huang
ICLR 2023
Paper / Project / Code
Our model, Bi-level Optimized Query Slot Attention, achieves state-of-the-art results on 3 challenging synthetic and 7 complex real-world datasets in unsupervised
image segmentation and reconstruction, outperforming previous baselines by a
large margin.
|
|
HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes
Zan Wang,
Yixin Chen,
Tengyu Liu,
Yixin Zhu,
Wei Liang,
Siyuan Huang
NeurIPS 2022
Paper / Project / Code
We present a novel scene-and-language conditioned generative model that can produce 3D human motions of the desirable action interacting with the specified objects. For example, sit on the armchair near the desk.To fill in the gap. We collect a large-scale and semantic-rich synthetic human-scene interaction dataset, denoted as HUMANISE, to enable such langauge-conditoned scene understanding tasks.
|
|
EgoTaskQA: Understanding Human Tasks in Egocentric Videos
Baoxiong Jia ,
Ting Lei,
Song-Chun Zhu,
Siyuan Huang
NeurIPS 2022 (Dataset and Benchmark Track)
Paper / Project / Code
we introduce the EgoTaskQA benchmark that8
provides a single home for the crucial dimensions of task understanding through
question answering on real-world egocentric videos. We meticulously design
questions that target the understanding of (1) action dependencies and effects,
(2) intents and goals, and (3) agents' beliefs about others.
|
|
Learning V1 simple cells with vector representations of local contents and matrix representations of local motions
Ruiqi Gao ,
Jianwen Xie ,
Siyuan Huang,
Yufan Ren,
Song-Chun Zhu,
Ying Nian Wu
AAAI 2022
Paper
we propose a representational model that couples the vector representations of local image contents with the matrix representations of local pixel displacements. When the image changes from one time frame to the next due to pixel displacements, the vector at each pixel is rotated by a matrix that represents the displacement of this pixel.
|
|
Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds
Siyuan Huang*,
Yichen Xie*
Song-Chun Zhu,
Yixin Zhu
ICCV 2021
Paper /
Supplementary /
Project /
Code
We introduce a spatio-temporal representation
learning (STRL) framework, capable of learning from unlabeled 3D point clouds in a self-supervised fashion.
|
|
Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning
Pan Lu*, Ran Gong*, Shibiao Jiang*, Liang Qiu, Siyuan Huang, Xiaodan Liang, Song-Chun Zhu
ACL 2021 (Oral Presentation)
Paper /
Code /
Project /
Bibtex
we construct a new largescale benchmarkconsisting of 3,002 geometry problems with dense annotation in formal language and propose a novel geometry solving approach with formal language and symbolic reasoning.
|
|
Learning Neural Representation of Camera Pose with Matrix Representation of Pose Shift via View Synthesis
Yaxuan Zhu,
Ruiqi Gao,
Siyuan Huang,
Song-Chun Zhu,
Ying Nian Wu
CVPR 2021 (Oral Presentation)
Paper /
Supplementary /
Code
To efficiently represent camera pose in 3D computer vision, we propose an approach to learn neural representations of camera poses and 3D scenes, coupled with neural representations of local camera movements.
|
|
A Competence-aware Curriculum for Visual Concepts Learning via Question Answering
Qing Li ,
Siyuan Huang,
Yining Hong,
Song-Chun Zhu
ECCV 2020 (Oral Presentation)
Paper
We design a neural-symbolic concept learner for learning the visual concepts and a multi-dimensional Item Response Theory (mIRT) model for guiding the visual concept learning process with an adaptive curriculum.
|
|
LEMMA: A Multi-view Dataset for Learning Multi-agent Multi-task Activities
Baoxiong Jia,
Yixin Chen,
Siyuan Huang,
Yixin Zhu,
Song-Chun Zhu
ECCV 2020
Paper /
Code /
Project /
Bibtex
We introduce the LEMMA dataset to provide a single home to address these missing dimensions with carefully designed settings, wherein the numbers of tasks and agents vary to highlight different learning objectives. We densely annotate the atomic-actions with human-object interactions to provide ground-truth of the compositionality, scheduling, and assignment of daily activities.
|
|
Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense
Yixin Zhu ,
Tao Gao ,
Lifeng Fan ,
Siyuan Huang,
Edmonds Mark,
Hangxin Liu,
Feng Gao,
Chi Zhang,
Siyuan Qi,
Ying Nian Wu,
Josh Tenenbaum,
Song-Chun Zhu
Engineering 2020
Paper
We demonstrate the power of this perspective to develop cognitive AI systems with humanlike common sense by showing how to
observe and apply FPICU with little training data to solve a wide range of challenging tasks, including tool use, planning, utility
inference, and social learning.
|
|
A Generalized Earley Parser for Human Activity Parsing and Prediction
Siyuan Qi ,
Baoxiong Jia ,
Siyuan Huang,
Ping Wei,
Song-Chun Zhu
TPAMI 2020
Paper
Propose an algorithm to tackle the task of understanding complex human activities from (partially observed) videos from two important aspects: activity recognition and prediction.
|
|
PerspectiveNet: 3D Object Detection from a Single RGB Image via Perspective Points
Siyuan Huang,
Yixin Chen,
Tao Yuan,
Siyuan Qi,
Yixin Zhu,
Song-Chun Zhu
Neural Information Processing Systems (NeurIPS) 2019
Paper
To solve the problem of 3D object detection, we propose perspective points as a novel intermediate representation, defined as the 2D projections of locally-Manhattan 3D keypoints to locate an object, and they satisfy certain geometric constraints caused by the perspective projection.
|
|
Holistic++ Scene Understanding: Single-view 3D Holistic Scene Parsing and Human Pose Estimation with Human-Object Interaction and Physical Commonsense
Yixin Chen *,
Siyuan Huang *,
Tao Yuan,
Siyuan Qi,
Yixin Zhu,
Song-Chun Zhu
IEEE International Conference on Computer Vision (ICCV) 2019
* Equal contributions
Paper /
Supplementary /
Project
Propose a new 3D holistic++ scene understanding problem, which jointly tackles two tasks from a single-view image: (i) holistic scene parsing and reconstruction- and (ii) 3D human pose estimation. We incorporate the human-object interaction (HOI) and physical commonsense to tackle this problem.
|
|
Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning
Lifeng Fan *,
Wenguan Wang *,
Siyuan Huang,
Xinyu Tang,
Song-Chun Zhu
IEEE International Conference on Computer Vision (ICCV) 2019
* Equal contributions
Paper
Propose a new problem of understanding human gaze communication in social videos from both atomic-level and event-level, which is significant for studying human social interactions.
|
|
Configurable 3D Scene Synthesis and 2D Image Rendering
with Per-Pixel Ground Truth using Stochastic Grammars
* Equal contributions
Internatianal Journal of Computer Vision (IJCV) 2018
Paper /
Demo
Employ physics-based rendering to synthesize photorealistic RGB images while automatically synthesizing detailed,per-pixel ground truth data, including visible surface depth and normal, object identity and material information, as well as illumination.
|
|
Human-centric Indoor Scene Synthesis using Stochastic Grammar
Siyuan Qi,
Yixin Zhu ,
Siyuan Huang,
Chenfanfu Jiang ,
Song-Chun Zhu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018
Paper /
Project /
Code /
Bibtex
Present a human-centric method to sample and synthesize 3D room layouts and 2D images thereof, for the purpose of obtaining large-scale 2D/3D image data with the perfect per-pixel ground truth.
|
|
Predicting Human Activities Using Stochastic Grammar
Siyuan Qi,
Siyuan Huang,
Ping Wei,
Song-Chun Zhu
IEEE International Conference on Computer Vision (ICCV) 2017
Paper /
Bibtex /
Code
Use a stochastic grammar model to capture the compositional structure of events, integrating human actions, objects, and their affordances for modeling the rich context between human and environment.
|
|
Nonlinear Local Metric Learning for Person Re-identification
Siyuan Huang,
Jiwen Lu,
Jie Zhou,
Anil K. Jain
arXiv 2015
arXiv Paper
Utilize the merits of both local metric learning and deep neural network to exploit the complex nonlinear transformations in the feature space of person re-identification data.
|
|
Building Change Detection Based on 3D reconstruction
Baohua Chen,
Lei Deng,
Yueqi Duan,
Siyuan Huang,
Jie Zhou
IEEE International Conference on Image Processing (ICIP) 2015
Paper /
Bibtex
Propose a change detection framework based on RGB-D map generated by 3D reconstruction which can overcome the large illumination changes .
|
|