Object level Visual Reasoning in Videos

Fabien Baradel


Natalia Neverova

Facebook Research

Christian Wolf


Julien Mille

INSA Centre Val de Loire

Greg Mori

Simon Fraser University

ECCV 2018


Human activity recognition is typically addressed by detecting key concepts like global and local motion, features related to object classes present in the scene, as well as features related to the global context. The next open challenges in activity recognition require a level of understanding that pushes beyond this and call for models with capabilities for fine distinction and detailed comprehension of interactions between actors and objects in a scene. We propose a model capable of learning to reason about semantically meaningful spatio-temporal interactions in videos. The key to our approach is a choice of performing this reasoning at the object level through the integration of state of the art object detection networks. This allows the model to learn detailed spatial interactions that exist at a semantic, object-interaction relevant level. We evaluate our method on three standard datasets (Twenty-BN Something-Something, VLOG and EPIC Kitchens) and achieve state of the art results on all of them. Finally, we show visualizations of the interactions learned by the model, which illustrate object classes and their interactions corresponding to different activity classes.

Camera Ready paper | arXiv version | Complementary Mask Data | Poster

What happened between these two frames?
Humans have this extraordinary ability of performing visual reasoning on very complicated tasks while it remains unattainable for contemporary computer vision algorithms (Below a carrot was chopped by the human).

Object Relation Network

We propose an Object Relation Network (ORN), a neural network module for reasoning between detected semantic object instances through space and time. The ORN has potential to address these issues and conduct relational reasoning over object interactions for the purpose of activity recognition. A set of object detection masks ranging over different object categories and temporal occurrences is input to the ORN. The ORN is able to infer pairwise relationships between objects detected at varying different moments in time. ORN selects objects from different frames which have a semantic definition (e.g. a carrot) and is able of long range reasoning.

Explainer Video

Below is a 2 minutes video briefly explaining our model and showing selected examples.

Code and Masks

We open-source our Pytorch implementation on Github and release the Mask-RCNN predictions used for the object head (resolution=100x100 and min confidence=0.5). You should first download the Mask-RCNN predictions on VLOG and EPIC-AR. For more informations please refer to the following website -> Complementary Mask Data.


                author = {Baradel, Fabien and Neverova, Natalia and Wolf, Christian and Mille, Julien and Mori, Greg},
                title = {Object Level Visual Reasoning in Videos},
                booktitle = {ECCV},
                month = {June},
                year = {2018}


This work was supported by the ANR/NSREC DeepVision project