Pose-conditioned Spatio-Temporal Attention
for Human Action Recognition

Fabien Baradel
INSA Lyon
Christian Wolf
INSA Lyon
Julien Mille
INSA Centre Val de Loire

arXiv:1703.10106

Abstract

We address human action recognition from multi-modal video data involving articulated pose and RGB frames and propose a two-stream approach. The pose stream is processed with a convolutional model taking as input a 3D tensor holding data from a sub-sequence. A specific joint ordering, which respects the topology of the human body, ensures that different convolutional layers correspond to meaningful levels of abstraction. The raw RGB stream is handled by a spatio-temporal soft-attention mechanism conditioned on features from the pose network. An LSTM network receives input from a set of image locations at each instant. A trainable glimpse sensor extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. Appearance features give important cues on hand motion and on objects held in each hand. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself.Finally a temporal attention mechanism learns how to fuse LSTM features over time. We evaluate the method on 3 datasets. State-of-the-art results are achieved on the largest dataset for human activity recognition, namely NTU-RGB+D, as well as on the SBU Kinect Interaction dataset. Performance close to state-of-the-art is achieved on the smaller MSR Daily Activity 3D dataset.

Download Paper

Explainer Video

Below is a 4 minutes video briefly explaining our model and showing selected examples.

Visualisation of the Attention Process

We show some selected actions and their corresponding representations according to our proposed model.

NTU RGB+D Dataset

RGB raw data

Attention process

"Giving something to other person"

MSR Daily Activity 3D Dataset

RGB raw data

Attention process

"Cellphone calling"

SBU Kinect Interaction Dataset

RGB raw data

Attention process

"Giving something to other person"

Bibtex

@article{baradel2017a,
  author    = {Fabien Baradel and Christian Wolf and Julien Mille},
  title     = {Pose-conditioned Spatio-Temporal Attention for Human Action Recognition},
  journal   = {arxiv},
  volume    = {1703.10106},
  year      = {2017},
}

Acknowledgements

This work was supported by the ANR/NSREC DeepVision project

Pose-conditioned Spatio-Temporal Attentionfor Human Action Recognition

Abstract

Explainer Video

Visualisation of the Attention Process

NTU RGB+D Dataset

RGB raw data

Attention process

MSR Daily Activity 3D Dataset

RGB raw data

Attention process

SBU Kinect Interaction Dataset

RGB raw data

Attention process

Bibtex

Acknowledgements

Pose-conditioned Spatio-Temporal Attention
for Human Action Recognition