Abstract:
Navigation towards different objects is prevalent in daily lives. State-of-the-art embodied vision methods accomplish the task by implicitly learning the relationship between perception and action or optimizing them with separate objectives. While effective in some cases, they have yet developed (1) a tight integration of perception and action, and (2) the capability to address visual variance that is significant in the moving and embodied setting. To close these research gaps, we introduce a new attention mechanism, which represents the pursuit of visual information that highlights the potential directions of final targets and guides agents' action for visual navigation. Instead of working conventionally as a weighted map for aggregating visual features, the new attention is defined as a compact intermediate state connecting visual observations and action. It is explicitly coupled with action to enable a joint optimization through a consistent action space, and also plays an importance role in alleviating the effects of visual variance. Our experiments show significant improvements in navigation across various types of unseen environments with known and unknown semantics. Ablation analyses indicate that the proposed method integrates perception and action by correlating attention patterns with the directions of action, and overcomes visual variance by distilling useful information from visual observations into attention distribution. Our code is publicly available at https://github.com/szzexpoi/ana.