Abstract:
Human object interaction (HOI) detection is an important task in image understanding and reasoning. It is in a form of HOI triplet<human,verb,object> , requiring bounding boxes for humans and objects, and action be-tween them for the task completion. In other words, this task requires strong supervision for training, which is how-ever hard to procure. A natural solution to overcome this is to pursue weakly-supervised learning, where we only know the presence of certain HOI triplets in images but their ex-act location is unknown. Most weakly-supervised learning methods do not make provision for leveraging data with strong supervision, when they are available; and indeed a naive combination of this two paradigms in HOI detection fails to make contributions to each other. In this regard we propose a mixed-supervised HOI detection pipeline: thanks to a specific design of momentum-independent learning, it learns seamlessly across these two types of supervision. Moreover, in light of the annotation insufficiency in mixed supervision, we introduce an HOI element swap-ping technique to synthesize diverse and hard negatives across images and improve the robustness of the model. Our method is evaluated on the challenging HICO-DET dataset. It outperforms the state of the art weakly- and fully-supervised methods under the same setting; and performs close to or even better than many fully-supervised methods by using a mixed amount of full and weak supervision.