Siyuan Huang

Seattle I am a Research Scientist and Team Lead at Beijing Institute for General Artificial Intelligence (BIGAI). I received my Ph.D. from Department of Statistics at University of California, Los Angeles (UCLA) advised by Professor Song-Chun Zhu. During my Ph.D., I have interned at DeepMind and Facebook Reality Lab. Before UCLA, I graduated from Tsinghua University with a Bachelors in Department of Automation.


My research interests lie in computer vision, machine learning, cognition, and robotics.
I currently focus on the problem of human-like holistic 3D scene understanding which contains perception, interaction, learning and reasoning.

  • Perception: Task-oriented 3D Scene Parsing & Reconstruction & Synthsis, Action Understanding
  • Interaction: 4D Human-object Interaction, Human-human Interaction
  • Learning: Neural-symbolic Learning, Structure Learning, Self-supervised Learning, Concept Learning
  • Reasoning: Vision-and-Language Reasoning, Cognitive Reasoning, Systematic Generalization
I would like to develop tools to help machines learn 3D representations, percept 3D world, and interact with 3D environments from images or videos. My long-term goal is to build a general-purpose intelligent machine that could understand and interact with the 3D environment like humans.
E-Mail / CV / Google Scholar / Thesis
News
  • 10/2021 NEW   Invited talk about Compositional Strucutres in Vision and Language at StruCo3D2021 workshop.
  • 07/2021 NEW   Three papers accepeted by ICCV21, including one oral presentation about embodied reference understanding.
  • 06/2021 NEW   I am co-organizing the 3rd CVPR 2021 workshop: 3D Scene Understanding for Vision, Graphics and Robotics
  • . The workshop will be virtual.
  • 06/2021 NEW   Defend my Ph.D. dissertation Human-like Holistic 3D Scene Understanding.
  • 05/2021 NEW   Two papers about embodied reference understanding and systematic generalization is accepted by ICLR 2021 workshop.
  • 02/2021 NEW   One paper about neural representation of camera pose is accepted by CVPR 2021 as oral.
  • 08/2020 NEW   Two papers about neural-symbolic learning and math word problems are accepted by AAAI 2021.
  • 08/2020 LEMMA dataset and code released.
  • 07/2020 Our paper won the Best Paper Award in ICML2020 Workshop on Bridge Between Perception and Reasoning: Graph Neural Networks & Beyond.
  • 07/2020 Two papers accepted by ECCV 2020, one of them as oral presentation.
  • 05/2020 Awarded the UCLA Dissertation Year Fellowship.
  • 05/2020 One paper about neural symbolic learning accepted by ICML 2020. Check the project and for more details including the code.
  • 05/2020 Our CVPR workshop will be totally virtual. Please check the scene understanding workshop for more details.
  • 03/2020 I am co-organizing the 2nd CVPR 2020 workshop: 3D Scene Understanding for Vision, Graphics and Robotics.
  • 02/2020 I will intern at DeepMind, London during the summer.
  • 07/2019 One paper on 3D object detection is accepted by NeurIPS 2019.
  • 05/2020 Two papers accepted by ICCV 2019. Check our paper on holistic++ scene understanding ( live demo for 3D human pose estimation) and human gaze communication.
  • 03/2019 I am co-organizing the CVPR 2019 workshop: 3D Scene Understanding for Vision, Graphics and Robotics.



Publications

TPAMI 2019 PartAfford: Part-level Affordance Discovery from 3D Objects
Chao Xu , Yixin Chen, He Wang, Song-Chun Zhu, Yixin Zhu, Siyuan Huang
Arxiv
Paper
We present a new task of part-level affordance discovery (PartAfford): Given only the affordance labels per object, the machine is tasked to (i) decompose 3D shapes into parts and (ii) discover how each part of the object corresponds to a certain affordance category.


TPAMI 2019 Learning V1 simple cells with vector representations of local contents and matrix representations of local motions
Ruiqi Gao , Jianwen Xie , Siyuan Huang, Yufan Ren, Song-Chun Zhu, Ying Nian Wu
AAAI 2022
Paper
we propose a representational model that couples the vector representations of local image contents with the matrix representations of local pixel displacements. When the image changes from one time frame to the next due to pixel displacements, the vector at each pixel is rotated by a matrix that represents the displacement of this pixel.


AAAI 2021 VLGrammar: Grounded Grammar Induction of Vision and Language
Yining Hong, Qing Li, Song-Chun Zhu, Siyuan Huang
ICCV 2021
Paper / Supplementary / Code
We study grounded grammar induction of vision and language in a joint learning framework.


ICLR 2021 YouRefIt: Embodied Reference Understanding with Language and Gesture
Yixin Chen, Qing Li, Deqian Kong, Yik Lun Kei, Tao Gao , Yixin Zhu, Song-Chun Zhu, Siyuan Huang
ICCV 2021 (Oral)
ICLR 2021 Embodied Multimodal Leaning Workshop (Short Version)
Paper / Supplementary / Project / Code
We study the machine's understanding of embodied reference: One agent uses both language and gesture to refer to an object to another agent in a shared physical environment.


ICCV 2021 Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds
Siyuan Huang*, Yichen Xie* Song-Chun Zhu, Yixin Zhu
ICCV 2021
Paper / Supplementary / Project / Code
We introduce a spatio-temporal representation learning (STRL) framework, capable of learning from unlabeled 3D point clouds in a self-supervised fashion.

ICLR 2021 A HINT from Arithmetic: On Systematic Generalization of Perception, Syntax, and Semantics
Qing Li, Siyuan Huang, Yining Hong, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu
ICLR 2021 The Role of Mathematical Reasoning in General Artificial Intelligence Workshop (Short Version)
Paper
we present a new dataset, HINT, to study machines' capability of learning generalizable concepts at three different levels: perception, syntax, and semantics.


ACL 2021 Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning
Pan Lu*, Ran Gong*, Shibiao Jiang*, Liang Qiu, Siyuan Huang, Xiaodan Liang, Song-Chun Zhu
ACL 2021 (Oral)
Paper / Code / Project / Bibtex
we construct a new largescale benchmarkconsisting of 3,002 geometry problems with dense annotation in formal language and propose a novel geometry solving approach with formal language and symbolic reasoning.


CVPR 2021 Learning Neural Representation of Camera Pose with Matrix Representation of Pose Shift via View Synthesis
Yaxuan Zhu, Ruiqi Gao, Siyuan Huang, Song-Chun Zhu, Ying Nian Wu
CVPR 2021 (Oral)
Paper / Supplementary / Code
To efficiently represent camera pose in 3D computer vision, we propose an approach to learn neural representations of camera poses and 3D scenes, coupled with neural representations of local camera movements.


AAAI 2021 SMART: A Situation Model for Algebra Story Problems via Attributed Grammar
Yining Hong, Qing Li, Ran Gong, Daniel Ciao, Siyuan Huang, Song-Chun Zhu
AAAI 2021
Paper / Project / Bibtex
We propose SMART, which adopts attributed grammar as the representation of situation models for solving the algebra story problems.


AAAI 2021 Learning by Fixing: Solving Math Word Problems with Weak Supervision
Yining Hong, Qing Li, Daniel Ciao, Siyuan Huang, Song-Chun Zhu
AAAI 2021
Paper / Supplementary / Code / Project / Bibtex
We introduce a weakly-supervised paradigm for learning math word problems. Our method only requires the annotations of the final answers and can generate various solutions for a single problem.


ECCV 2020 A Competence-aware Curriculum for Visual Concepts Learning via Question Answering
Qing Li , Siyuan Huang, Yining Hong, Song-Chun Zhu
ECCV 2020 (Oral)
Paper
We design a neural-symbolic concept learner for learning the visual concepts and a multi-dimensional Item Response Theory (mIRT) model for guiding the visual concept learning process with an adaptive curriculum.


ECCV 2020 LEMMA: A Multi-view Dataset for Learning Multi-agent Multi-task Activities
Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, Song-Chun Zhu
ECCV 2020
Paper / Code / Project / Bibtex
We introduce the LEMMA dataset to provide a single home to address these missing dimensions with carefully designed settings, wherein the numbers of tasks and agents vary to highlight different learning objectives. We densely annotate the atomic-actions with human-object interactions to provide ground-truth of the compositionality, scheduling, and assignment of daily activities.


ICML 2020 Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning
Qing Li , Siyuan Huang, Yining Hong, Yixin Chen, Ying Nian Wu, Song-Chun Zhu
ICML 2020
Best Paper Award in Workshop on Bridge Between Perception and Reasoning: Graph Neural Networks & Beyond.
Paper / Supplementary / Code / Project / Bibtex
We close the loop of neural-symbolic learning by introducing the grammar}model as a symbolic prior to bridge neural perception and symbolic reasoning, and proposing a novel back-search algorithm which mimics the top-down human-like learning procedure to propagate the error through the symbolic reasoning module efficiently.


Engineering 2020 Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense
Yixin Zhu , Tao Gao , Lifeng Fan , Siyuan Huang, Edmonds Mark, Hangxin Liu, Feng Gao, Chi Zhang, Siyuan Qi, Ying Nian Wu, Josh Tenenbaum, Song-Chun Zhu
Engineering 2020
Paper
We demonstrate the power of this perspective to develop cognitive AI systems with humanlike common sense by showing how to observe and apply FPICU with little training data to solve a wide range of challenging tasks, including tool use, planning, utility inference, and social learning.


TPAMI 2019 A Generalized Earley Parser for Human Activity Parsing and Prediction
Siyuan Qi , Baoxiong Jia , Siyuan Huang, Ping Wei, Song-Chun Zhu
TPAMI 2020
Paper
Propose an algorithm to tackle the task of understanding complex human activities from (partially observed) videos from two important aspects: activity recognition and prediction.


NeurIPS19 PerspectiveNet: 3D Object Detection from a Single RGB Image via Perspective Points
Siyuan Huang, Yixin Chen, Tao Yuan, Siyuan Qi, Yixin Zhu, Song-Chun Zhu
Neural Information Processing Systems (NeurIPS) 2019
Paper
To solve the problem of 3D object detection, we propose perspective points as a novel intermediate representation, defined as the 2D projections of locally-Manhattan 3D keypoints to locate an object, and they satisfy certain geometric constraints caused by the perspective projection.

ICCV19_HOLISTIC++ Holistic++ Scene Understanding: Single-view 3D Holistic Scene Parsing and Human Pose Estimation with Human-Object Interaction and Physical Commonsense
Yixin Chen *, Siyuan Huang *, Tao Yuan, Siyuan Qi, Yixin Zhu, Song-Chun Zhu
IEEE International Conference on Computer Vision (ICCV) 2019
* Equal contributions
Paper / Supplementary / Project
Propose a new 3D holistic++ scene understanding problem, which jointly tackles two tasks from a single-view image: (i) holistic scene parsing and reconstruction- and (ii) 3D human pose estimation. We incorporate the human-object interaction (HOI) and physical commonsense to tackle this problem.


ICCV19_GAZE Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning
Lifeng Fan *, Wenguan Wang *, Siyuan Huang, Xinyu Tang, Song-Chun Zhu
IEEE International Conference on Computer Vision (ICCV) 2019
* Equal contributions
Paper
Propose a new problem of understanding human gaze communication in social videos from both atomic-level and event-level, which is significant for studying human social interactions.


NeurIPS 2018 Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation
Siyuan Huang, Siyuan Qi, Yinxue Xiao, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu
Neural Information Processing Systems (NeurIPS) 2018
Paper / Supplementary / Poster / Video / Code / Project
Propose an end-to-end model that simultaneously solves tasks of 3D object detection, 3D layout estimation and camera pose estimation in real-time given only a single RGB image


ECCV 2018 Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image
Siyuan Huang, Siyuan Qi, Yixin Zhu, Yinxue Xiao, Yuanlu Xu, Song-Chun Zhu
European Conference on Computer Vision (ECCV) 2018
Paper / Supplementary / Project / Code / Poster / Bibtex
Propose a computational framework to parse and reconstruct the 3D configuration of an indoor scene from a single RGB image in an analysis-by-synthesis fasion using a stochastic grammar model.



IJCV
Configurable 3D Scene Synthesis and 2D Image Rendering with Per-Pixel Ground Truth using Stochastic Grammars
Chenfanfu Jiang *, Siyuan Qi *, Yixin Zhu *, Siyuan Huang *, Jenny Lin, Xingwen Guo, Lap-Fai Yu, Demetri Terzopoulos, Song-Chun Zhu
* Equal contributions
Internatianal Journal of Computer Vision (IJCV) 2018
Paper / Demo
Employ physics-based rendering to synthesize photorealistic RGB images while automatically synthesizing detailed,per-pixel ground truth data, including visible surface depth and normal, object identity and material information, as well as illumination.


CVPR 2018 Human-centric Indoor Scene Synthesis using Stochastic Grammar
Siyuan Qi, Yixin Zhu , Siyuan Huang, Chenfanfu Jiang , Song-Chun Zhu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018
Paper / Project / Code / Bibtex
Present a human-centric method to sample and synthesize 3D room layouts and 2D images thereof, for the purpose of obtaining large-scale 2D/3D image data with the perfect per-pixel ground truth.


ICCV 2017 Predicting Human Activities Using Stochastic Grammar
Siyuan Qi, Siyuan Huang, Ping Wei, Song-Chun Zhu
IEEE International Conference on Computer Vision (ICCV) 2017
Paper / Bibtex / Code
Use a stochastic grammar model to capture the compositional structure of events, integrating human actions, objects, and their affordances for modeling the rich context between human and environment.



arXiv 2015 Nonlinear Local Metric Learning for Person Re-identification
Siyuan Huang, Jiwen Lu, Jie Zhou, Anil K. Jain
arXiv 2015
arXiv Paper
Utilize the merits of both local metric learning and deep neural network to exploit the complex nonlinear transformations in the feature space of person re-identification data.

icip 2015 Building Change Detection Based on 3D reconstruction
Baohua Chen, Lei Deng, Yueqi Duan, Siyuan Huang, Jie Zhou
IEEE International Conference on Image Processing (ICIP) 2015
Paper / Bibtex
Propose a change detection framework based on RGB-D map generated by 3D reconstruction which can overcome the large illumination changes .