动作分类和检测数据集 (Dataset for Action Classification and Detection)

Weizmann Dataset

摘自 2016-Review of Action Recognition and Detection Methods 【这篇综述里有较全面的数据集介绍】

The Weizman dataset was recorded by static camera with clean background.

CMU Crowded Videos Dataset

摘自 2016-Review of Action Recognition and Detection Methods

The CMU Crowded Videos Dataset was recorded by static camera with background motion.

Breakfast Dataset

The breakfast video dataset consists of 10 cooking activities performed by 52 different actors in multiple kitchen locations Cooking activities included the preparation of coffee, orange juice , chocolate milk, tea, a bowl of cereals, fried eggs, pancakes, a fruit salad, a sandwich and scrambled eggs. All videos were manually labeled based on 48 different action units with 11,267 samples.

PKUMMD

PKU-MMD is a large-scale dataset focusing on long continuous sequences action detection and multi-modality action analysis. The dataset is captured via the Kinect v2 sensor. There are 364x3(view) long action sequences, each of which lasts about 3∼4 minutes (recording ratio set to 30FPS) and contains approximately 20 action instances. The total scale of the dataset is 5,312,580 frames of 3,000 minutes with 21,545 temporally localized actions.

可视化链接：PKU-MMD骨架数据可视化程序，动态图，python

something-something Dataset

The 20BN-SOMETHING-SOMETHING dataset is a large collection of densely-labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world.

Kinetics 400

DeepMind做的一个数据集。

The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands.

UCF101

The UCF101 is an action recognition data set of realistic action videos, collected from YouTube. The UCF-101 dataset contains 13,320 videos with an average length of 6.2 seconds belonging to 101 different action categories. The dataset has 3 standard train/test splits with the training set containing around 9,500 videos in each split (the rest are test).

UCF101-24

关于UCF101-24更多的介绍，可参考：时空行为检测数据集 JHMDB & UCF101_24 详解

UCF101-24数据集下载链接：MOC_dataset - Google 云端硬盘

UCF-Sports

官网链接：https://www.crcv.ucf.edu/data/UCF_Sports_Action.php

(截图摘自 2016-Multi-region two-stream R-CNN for action detection - ECCV)

MultiSports

（上一截图来自2022-Spatio-Temporal Action Detection Under Large Motion-Arxiv）

HDMB-51

The HMDB was collected from various sources, mostly from movies, and a small proportion from public databases such as the Prelinger archive, YouTube and Google videos. The dataset contains 6849 clips divided into 51 action categories, each containing a minimum of 101 clips. Mean length of the clips is 3.2 s. The actions categories can be grouped in five types:

General facial actions smile, laugh, chew, talk.
Facial actions with object manipulation: smoke, eat, drink.
General body movements: cartwheel, clap hands, climb, climb stairs, dive, fall on the floor, backhand flip, handstand, jump, pull up, push up, run, sit down, sit up, somersault, stand up, turn, walk, wave.
Body movements with object interaction: brush hair, catch, draw sword, dribble, golf, hit something, kick ball, pick, pour, push something, ride bike, ride horse, shoot ball, shoot bow, shoot gun, swing baseball bat, sword exercise, throw.
Body movements for human interaction: fencing, hug, kick someone, kiss, punch, shake hands, sword fight.

J-HDMB

J-HDMB是HDMB-51的一个子数据集，详情可参考：时空行为检测数据集 JHMDB & UCF101_24 详解

J-HDMB数据集下载链接：MOC_dataset - Google 云端硬盘

THUMOS Challenge 2015

官网链接：THUMOS Challenge 2015

MultiTHUMOS Dataset

官网链接：Serena Yeung (stanford.edu)

SBU Kinect Interaction dataset

The SBU Kinect Interaction dataset is a Kinect captured human activity recognition dataset depicting two person interaction. It contains 282 skeleton sequences and 6822 frames of 8 classes. There are 15 joints for each skeleton.

NTU RGB+D

github的链接

"NTU RGB+D" contains 60 action classes and 56,880 video samples.
"NTU RGB+D 120" extends "NTU RGB+D" by adding another 60 classes and another 57,600 video samples, i.e., "NTU RGB+D 120" has 120 classes and 114,480 samples in total.
These two datasets both contain RGB videos, depth map sequences, 3D skeletal data, and infrared (IR) videos for each sample. Each dataset is captured by three Kinect V2 cameras concurrently.
The resolutions of RGB videos are 1920x1080, depth maps and IR videos are all in 512x424, and 3D skeletal data contains the 3D coordinates of 25 body joints at each frame.

可视化程序：NTU RGB+D数据集，骨架数据可视化

CMU Mocap

This dataset contains 2,235 sequences.These sequences are performed by 144 non-professional actors. 31 body joints at each frame. There are only one person in each video.

数据集可视化链接：cmu骨架数据可视化

(图片摘自CSDN:CMU运动捕捉数据处理 MOCAP)

N-UCLA

This dataset include 10 action categories: pick up with one hand, pick up with two hands, drop trash, walk around, sit down, stand up, donning, doffing, throw, carry. Each action is performed by 10 actors. This dataset contains data taken from a variety of viewpoints. This dataset is captured by Kinect v1 and contains 1494 videos of 10 actions. These actions are performed by 10 subjects repeated 1 to 6 times. There are three views of each action and for each subject 20 joints are being recorded.

骨架数据可视化链接：N-UCLA骨架数据可视化

一些论文对该数据集的的介绍：

2020CVPR-PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition：

2019CVPR-An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition：

2017PR-Enhanced skeleton visualization for view invariant human action recognition：

UWA3D

It consists of 30 different actions performed by 10 subjects. It provides actions samples from 4 different views: front (V1), left side (V2), right side (V3), and top view (V4). The total number of action sequences is 1075. The High inter-class similarity of actions (e.g., drinking and making phone call) and diversity of views points make the action recognition very challenging.

AVA

AVA下载方法的介绍可参考【ava数据集】ava数据集下载使用迅雷_CV-杨帆的博客-CSDN博客

80 atomic visual actions, with 14 pose classes, 49 personobject interaction classes and 17 person-person interaction classes.

430 15-minute video clips

1.58M action labels with multiple labels per person

视频来源：The raw video content of the AVA dataset comes from YouTube. 取每个视频的第15分钟~第30分钟。Each 15-min clip is then partitioned into 897 overlapping 3s movie segments with a stride of 1 second.

标注方法：We use short segments (+-1.5 seconds centered on a keyframe) to provide temporal context for labeling the actions in the middle frame. 注意这些segments是有overlap的，stride为1秒。

在AVA数据集里，The corresponding labels are provided for one frame per second.

而CMU, MSR Actions, UCF Sports and JHMDB provide spatio-temporal annotations in each frame for short videos.