Abstract:
In this paper we address the problem of weakly-supervised Visual Relation Detection (VRD) and human-centric Scene Graph generation. Unlike prior works, we assume weaker, yet more natural, supervisory signals. Specifically, we only assume a pre-trained person detector, a generic region proposal mechanism and a set of image-level object and relation labels per frame. Given this data we formulate a very simple architecture with multi-task weak-supervision at object level (for individual proposed regions) and relation level (for each person-object region pair). We show that despite simplicity, our approach achieves state-of-the-art results as compared to other weakly- and strongly-supervised VRD models that are significantly more complex. In ablations, we also show that proposed multi-task learning improves relation predictions. Our goal in this paper is to propose a strong, yet simple, baseline which will spur further developments in the VRD task.