TY - GEN
T1 - Learning to exploit stability for 3D scene parsing
AU - Du, Yilun
AU - Liu, Zhijian
AU - Basevi, Hector
AU - Leonardis, Ales
AU - Freeman, William T.
AU - Tenenbaum, Joshua T.
AU - Wu, Jiajun
PY - 2018/12/2
Y1 - 2018/12/2
N2 - Human scene understanding uses a variety of visual and non-visual cues to perform inference on object types, poses, and relations. Physics is a rich and universal cue that we exploit to enhance scene understanding. In this paper, we integrate the physical cue of stability into the learning process by looping in a physics engine into bottom-up recognition models, and apply it to the problem of 3D scene parsing. We first show that applying physics supervision to an existing scene understanding model increases performance, produces more stable predictions, and allows training to an equivalent performance level with fewer annotated training examples. We then present a novel architecture for 3D scene parsing named Prim R-CNN, learning to predict bounding boxes as well as their 3D size, translation, and rotation. With physics supervision, Prim R-CNN outperforms existing scene understanding approaches on this problem. Finally, we show that finetuning with physics supervision on unlabeled real images improves real domain transfer of models training on synthetic data.
AB - Human scene understanding uses a variety of visual and non-visual cues to perform inference on object types, poses, and relations. Physics is a rich and universal cue that we exploit to enhance scene understanding. In this paper, we integrate the physical cue of stability into the learning process by looping in a physics engine into bottom-up recognition models, and apply it to the problem of 3D scene parsing. We first show that applying physics supervision to an existing scene understanding model increases performance, produces more stable predictions, and allows training to an equivalent performance level with fewer annotated training examples. We then present a novel architecture for 3D scene parsing named Prim R-CNN, learning to predict bounding boxes as well as their 3D size, translation, and rotation. With physics supervision, Prim R-CNN outperforms existing scene understanding approaches on this problem. Finally, we show that finetuning with physics supervision on unlabeled real images improves real domain transfer of models training on synthetic data.
M3 - Conference contribution
T3 - Electronic Proceedings of the Neural Information Processing Systems Conference
BT - Advances in Neural Information Processing Systems 31 (NIPS 2018)
A2 - Bengio, S.
A2 - Wallach, H.
A2 - Larochelle, H.
A2 - Grauman, K.
A2 - Cesa-Bianchi, N.
A2 - Garnett, R.
PB - NIPS
T2 - 32nd Conference on Neural Information Processing Systems (NIPS 2018)
Y2 - 2 December 2018 through 8 December 2018
ER -