FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding

Dian Shao

Yue Zhao

Bo Dai

Dahua Lin

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020, Oral Presentation

An overview of the FineGym dataset. We provide coarse-to-fine annotations both temporally and semantically. There are three levels of categorical labels. The temporal dimension (represented by the two bars) is also divided into two levels, i.e., actions and sub-actions. Sub-actions could be described generally using set categories or precisely using element categories. Ground-truth element categories of sub-action instances are obtained via manually constructed decision-trees.

Download (annotations)

Videos at CVPR'2020

Analysis

GitHub Repo

Abstract

On public benchmarks, current action recognition techniques have achieved great success. However, when used in real-world applications, e.g. sport analysis, which requires the capability of parsing an activity into phases and differentiating between subtly different actions, their performances remain far from being satisfactory. To take action recognition to a new level, we develop FineGym, a new dataset built on top of gymnasium videos. Compared to existing action recognition datasets, FineGym is distinguished in richness, quality, and diversity. In particular, it provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy. For example, a "balance beam" event will be annotated as a sequence of elementary sub-actions derived from five sets: "leap-jumphop", "beam-turns", "flight-salto", "flight-handspring", and "dismount", where the sub-action in each set will be further annotated with finely defined class labels. This new level of granularity presents significant challenges for action recognition, e.g. how to parse the temporal structures from a coherent action, and how to distinguish between subtly different action classes. We systematically investigate representative methods on this dataset and obtain a number of interesting findings. We hope this dataset could advance research towards action understanding.

Demo video

An illustrative video of FineGym's hiecharcial annotations given a complete competition. Action and subaction boundaries are highlighted while irrelevant fragments are fast-forwarded. We also present the tree-based process at the end of the demo video.

5-Minute Oral presentation video

1-Minute presentation video

Dataset hierarchy

FineGym organizes both the semantic and temporal annotations hierarchically. The upper part shows three levels of categorical labels, namely events (e.g. balance beam), sets (e.g. dismounts) and elements (e.g. salto forward tucked). The lower part depicts the two-level temporal annotations, i.e. the temporal boundaries of actions (in the top bar) and sub-action instances (in the bottom bar).

Sub-action examples

We present several examples of fine-grained sub-action instances. Each group belongs to three element categories within a same event (BB, FX, UB, and VT). It can be seen such fine-grained instances contain subtle and challenging differences. (Hover on the GIF for a 0.25x slowdown)

Balance Beam (BB)	Floor Exercise (FX)
Uneven Bar (UB)	Vault (VT)

Empirical Studies and Analysis

(1) Element-level action recognition raises great challenges for existing methods.

Element-level action recognition results of representative methods.

(2) Sparse sampling is insufficient for fine-grained action recognition.

Performances of TSN when varying the number of sampled frames during training.

(3) How important is temporal information?

(a) Motion features (e.g. optical flows) could capture frame-wise temporal dynamics, leading to better performance of TSN.
(b) Temporal dynamics play an important role in FineGym, and TRN could capture it.
(c) Performance of TSM drops sharply when the number of testing frames is very different from that in training, while TSN maintains its performance as only temporal average pooling is applied in it.

(a) Per-class performances of TSN with motion and appearance features in 6 element categories.
(b) Performances of TRN on the set UB-circles using ordered or shuffled testing frames.
(c) Mean-class accuracies of TSM and TSN on Gym99 when trained with 3 frames and tested with more frames.

(4) Does pre-training on large-scale video datasets help?

On FineGym, pre-training on Kinetics is not always helpful. One potential reason is the large gaps in terms of temporal patterns between coarse- and fine-grained actions.

Per-class performances of I3D pre-trained on Kinetics and ImageNet in various element categories.

(4) Why pose information does not help?

Skeleton-based ST-GCN struggles due to the challenges in skeleton estimation on gymnastics instances.

The results of person detection and pose estimation using AlphaPose for a Vault routine. It can be seen that detections and pose estimations of the gymnast are missed in multiple frames, especially in frames with intense motion. These frames are important for fine-grained recognition. (Hover on the GIF for a 0.25x slowdown)

Download

Updates

[23/07/2020] We have made pre-extracted feature available at GitHub. Check out here.
[16/04/2020] We fix a small issue on the naming of the subaction identifier "A_{ZZZZ}_{WWWW}" to avoid ambiguity. (Thanks Haodong Duan for pointing this out.)
[16/04/2020] We include new subsections to track updates and address FAQs.

FAQs

Q0: License issue:
A0: The annotations of FineGym are copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License.
Q1: Some links are invalid on YouTube. How can I obtain the missing videos?
Q1': I am located in mainland China and I cannot access YouTube. How can I get the dataset?
A1: Please submit a Google form at this link. We may reach you shortly.
Q2: Is the event-/element-level instance in your dataset cut in integral seconds?
A2: No. All levels of instances (actions and sub-actions) are annotated in exact timestamp (milliseconds) in a pursuit of frame-level preciseness. The number in the identifier is derived from integral seconds due to conciseness. Please refer to the instructions below for details.
Q3: Difference of Mean and Top-1 accuracy in Table 2 & 3?
A3: The Top-K accuracy is the fraction of the instances whose correct label falls in the top-k most confident predictions. In our case we take K=1.
The mean accuracy is the averaged per-class accuracy. To be specific, we calculate the top-1 accuracy of each class i to be A_i. The mean accuracy is the arithmetic mean of A_{1...N}, i.e. (A_1 + A_2 + ... + A_N)/N, where N is the number of classes.

How to read the temporal annotation files (JSON)?

Below, we show an example entry from the above JSON annotation file:

"0LtLS9wROrk": {
	"E_002407_002435": {
		"event": 4,
		"segments": {
			"A_0003_0005": {
				"stages": 1,
				"timestamps": [
					[
						3.45,
						5.64
					]
				]
			},
			"A_0006_0008": { ... },
			"A_0023_0028": { ... },
			...
		},
		"timestamps": [
			[
				2407.32,
				2435.28
			]
		]
	},
	"E_002681_002688": {
		"event": 1,
		"segments": {
			"A_0000_0006": {
				"stages": 3,
				"timestamps": [
					[
						0.04,
						3.2
					],
					[
						3.2,
						4.49
					],
					[
						4.49,
						6.57
					]
				]
			}
		},
		"timestamps": [
			[
				2681.88,
				2688.48
			]
		]
	},
	"E_002710_002737": { ... },
	...
}

The example shows the annotations related to this video. First of all, we assign the unique identifier "0LtLS9wROrk" to that video, which corresponds to the 11-digit YouTube identifier.
It contains all action (event-level) instances, whose names follow the format of "E_{XXXXXX}_{YYYYYY}". Here, "E" indicates "Event", and "XXXXXX"/"YYYYYY" indicates the zero-padded starting and ending timestamp (in seconds and truncated to Int).
Each action instance includes (1) the exact timestamps in the original video ('timestamps', in seconds), (2) event label ('event'), and (3) a list of annotated subaction (element-level) instances ('segments').
The annotated subaction instances follow the format of "A_{ZZZZ}_{WWWW}". Here, "A" indicates "subAction", and "ZZZZ"/"WWWW" indicates the zero-padded starting and ending timestamp (in seconds and truncated to Int).
Ech subaction instance includes (1) the number of stages of this subaction instance ('stages', 3 for Vault and 1 for other events) (2) the exact timestamps of each stage relative to the starting time of event. ('timestamps', in seconds) As a result, each subaction instance has a unique identifier "{VIDEO_ID}_E_{XXXXXX}_{YYYYYY}_A_{ZZZZ}_{WWWW}". This identifier serves as the instance name in the train/val splits of Gym99 and Gym288.

How to read the question annotation files (JSON)?

Below, we show an example entry from the above JSON annotation file:

"0": {
	"BTcode": "1111111",
	"questions": [
		"round-off onto the springboard?",
		"turning entry after round-off (turning in first flight phase)?",
		"Facing the coming direction when handstand on vault
		(0.5 turn in first flight phase)?",
		"Body keep stretched  during salto (stretched salto)?",
		"Salto with turn?",
		"Facing vault table after landing?",
		"Salto with 1.5 turn?"
	],
	"code": "6.00"
},
"1": {
	"BTcode": "1111110",
	"questions": [
		"round-off onto the springboard?",
		"turning entry after round-off (turning in first flight phase)?",
		"Facing the coming direction when handstand on vault
		(0.5 turn in first flight phase)?",
		"Body keep stretched  during salto (stretched salto)?",
		"Salto with turn?",
		"Facing vault table after landing?",
		"Salto with 1.5 turn?"
	],
	"code": "5.20"
},
...

The example shows the questions related to each class. The identifier corresponds to the label name provided in Gym530 category list. Each class includes (1) a list of questions that are asked ('quetions'), (2) a string of binary codes ('BTcode') where 1 refers to 'yes' and 0 refers to 'no', (3) and original code in the official codebook.

Paper

Shao, Zhao, Dai, Lin.
FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
In CVPR, 2020 (oral).
(arXiv)

(Additional details/
supplementary materials)

Cite

@inproceedings{shao2020finegym,
title={FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding},
author={Shao, Dian and Zhao, Yue and Dai, Bo and Lin, Dahua},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2020}
}

Acknowledgements

We sincerely thank the outstanding annotation team for their excellent work. This work is partially supported by SenseTime Collaborative Grant on Large-scale Multi-modality Analysis and the General Research Funds (GRF) of Hong Kong (No. 14203518 and No. 14205719). The template of this webpage is borrowed from Richard Zhang .

Contact

For further questions and suggestions, please contact Dian Shao (sd017@ie.cuhk.edu.hk) or Zhenzhi Wang (wz122@ie.cuhk.edu.hk).