Generating Triples with Adversarial Networks for Scene Graph Construction.
Driven by successes in deep learning, computer vision research has begun tomove beyond object detection and image classification to more sophisticatedtasks like image captioning or visual question answering. Motivating suchendeavors is the desire for models to capture not only objects present in animage, but more fine-grained aspects of a scene such as relationships betweenobjects and their attributes. Scene graphs provide a formal construct forcapturing these aspects of an image. Despite this, there have been only a fewrecent efforts to generate scene graphs from imagery. Previous works limitthemselves to settings where bounding box information is available at traintime and do not attempt to generate scene graphs with attributes. In this paperwe propose a method, based on recent advancements in Generative AdversarialNetworks, to overcome these deficiencies. We take the approach of firstgenerating small subgraphs, each describing a single statement about a scenefrom a specific region of the input image chosen using an attention mechanism.By doing so, our method is able to produce portions of the scene graphs withattribute information without the need for bounding box labels. Then, thecomplete scene graph is constructed from these subgraphs. We show that ourmodel improves upon prior work in scene graph generation on state-of-the-artdata sets and accepted metrics. Further, we demonstrate that our model iscapable of handling a larger vocabulary size than prior work has attempted.
Continue reading and listening
Stay in the loop.
Subscribe to our newsletter for a weekly update on the latest podcast, news, events, and jobs postings.