Intra- and Inter-Action Understanding via Temporal Action Parsing

Dian Shao

Yue Zhao

Bo Dai

Dahua Lin

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020

Some samples from the proposed TAPOS dataset. From top to bottom, action classes are: triple jump, hammer throw, rings and tumbling. Temporal boundaries of sub-actions are annotated, complex actions are therefore composed of temporally adjacent subactions, e.g., hammer throw in the second row comprises swing, rotate boby and throw.

Abstract

Current methods for action recognition primarily rely on deep convolutional networks to derive feature embeddings of visual and motion features. While these methods have demonstrated remarkable performance on standard benchmarks, we are still in need of a better understanding as to how the videos, in particular their internal structures, relate to high-level semantics, which may lead to benefits in multiple aspects, e.g. interpretable predictions and even new methods that can take the recognition performances to a next level. Towards this goal, we construct TAPOS, a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top. Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition. We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them. On the constructed TAPOS, the proposed method is shown to reveal intra-action information, i.e. how action instances are made of sub-actions, and inter-action information, i.e. one specific sub-action may commonly appear in various actions.

Data pattern

Similar sub-actions are shared by irrelevant actions, e.g., jump in beam and triple jump (the first pair), somersault in uneven bars and diving (the second pair).

Model overview

Download

annotations
categories
Pre-extracted frames (Google Drive)

How to read the temporal annotation files (JSON)?

Below, we show an example entry from the above JSON annotation file:

{
"yMK2zxDDs2A": { # youtube_id of this video
	"s00004_0_100_7_931": { # action_id: this action is in shot4 ranging from 0.1s to 7.931s
        "action": 11, # action-level label
        "substages": [ # sub-action boundaries
            0,
            79,
            195
        ],
        "total_frames": 195, # total frame of thie action instance
        "shot_timestamps": [ # absolute temporal location of shot4 (s00004) within this video
            43.36,
            53.48
        ],
        "subset": "train" # train or validation
    }, .....

Paper

Related projects

Shao, Zhao, Dai, Lin. FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding. In CVPR, 2020 (oral) [arXiv][Project page]

Cite

@inproceedings{shao2020tapos,
title={Intra- and Inter-Action Understanding via Temporal Action Parsing},
author={Shao, Dian and Zhao, Yue and Dai, Bo and Lin, Dahua},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2020}
}

Acknowledgements

This work is partially supported by SenseTime Collaborative Grant on Large-scale Multi-modality Analysis and the General Research Funds (GRF) of Hong Kong (No. 14203518 and No. 14205719). The template of this webpage is borrowed from Richard Zhang.