Abstract:
We consider the problem of Bird's Eye View (BEV) segmentation with perspective monocular camera view as input. An effective solution to this problem is important in many autonomous navigation tasks such as behavior prediction and planning, being that the BEV segmented image provides an explainable intermediate representation that captures both the geometry and layout of the surrounding scene. Our approach to this problem involves a novel view transformation layer that effectively exploits depth maps to transform 2D image features to the BEV space. The framework includes the design of a neural network architecture to produce BEV segmentation maps using the proposed transformation layer. Of particular interest is evaluation of the proposed method in complex scenarios involving highly unstructured scenes that are not represented in static maps. In the absence of an appropriate dataset for this task, we introduce the EPOSH road-scene dataset that consists of 560 video-clips of highly unstructured construction scenes, annotated with unique labels in both perspective and BEV. For evaluation, we compare our approach with several competitive baselines and recently published works and show improvement over state of the art on the Nuscenes and on our EPOSH dataset. We plan to release the dataset, code and our trained models used in the paper at https://usa.honda-ri.com/eposh