Abstract:
Despite several recent advances in object-oriented generative temporal models, there are a few key challenges. First, while many of these achievements are indispensable for a general world model, it is unclear how we can combine the benefits of each method into a unified model. Second, despite using generative model objectives, abilities for object detection and tracking are mainly investigated, leaving the crucial ability of generation largely under question. Third, a few key abilities for more faithful generation such as multi-modal uncertainty and situated behavior are missing. In this paper, we introduce Generative Structured World Models (G-SWM). The G-SWM not only unifies the key properties of previous models in a principled framework but also achieves two crucial new abilities, multi-modal uncertainty and situated behavior. By investigating the generation ability in comparison to the previous models, we demonstrate that G-SWM achieves the best or comparable performance for all experiment settings including a few complex settings that have not been tested before.