Abstract
360∘ cameras capture the entire surrounding environment with a large field-of-view (FoV), exhibiting comprehensive visual information to directly infer the 3D structures, e.g., depth and surface normal, and semantic information simultaneously. Existing works predominantly specialize in a single task, leaving multi-task learning of 3D geometry and semantics largely unexplored. Achieving such an objective is, however, challenging due to: 1) inherent spherical distortion of planar equirectangular projection (ERP) and insufficient global perception induced by 360∘ image’s ultra-wide FoV (360∘ × 180∘); 2) non-trivial progress in effectively merging geometry and semantics among different tasks to achieve mutual benefits. In this paper, we propose a novel end-to-end multi-task learning framework, named Elite360M, capable of inferring 3D structures via e.g., depth and surface normal estimation, and semantics via semantic segmentation simultaneously. Our key idea is to build a representation with strong global perception and less distortion while exploring the inter- and cross-task relationships between geometry and semantics. We incorporate the distortion-free and spatially continuous icosahedron projection (ICOSAP) points and combine them with ERP to enhance global perception. With a negligible cost (∼1M parameters), a Bi-projection Bi-attention Fusion (B2F) module is thus designed to capture the semantic- and distance-aware dependencies between each pixel of the region-aware ERP feature and the ICOSAP point feature set. Moreover, we propose a novel Cross-task Collaboration (CoCo) module to explicitly extract task-specific geometric and semantic information from the learned representation to achieve preliminary predictions. It then integrates the spatial contextual information among tasks to realize cross-task fusion. Extensive experiments demonstrate the effectiveness and efficacy of Elite360M, outperforming the prior multi-task learning methods (designed for planar images) with significantly fewer parameters on two benchmark datasets. Moreover, our Elite360M exhibits on-par performance with the single-task learning methods. Code is available at https://VLIS2022.github.io/Elite360M.
| Original language | English |
|---|---|
| Publisher | arXiv |
| Publication status | Published - 18 Aug 2024 |
Fingerprint
Dive into the research topics of 'Elite360M: Efficient 360 Multi-task Learning via Bi-projection Fusion and Cross-task Collaboration'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver