Exploring the Impact of Rendering Method and Motion Quality on Model Performance when Using a Multi-view Synthetic Data for Action Recognition

Stanislav Panev*1      Emily Kim*1      Sai Abhishek Si Namburu1      Desislava Nikolova2
Celso de Melo3      Fernando de la Torre1      Jessica Hodgins1
1Carnegie Mellon University      2Technical University of Sofia      3Army Research Lab
*Equal Contribution

WACV 2024

REMAG - an HAR dataset suite comprises five datasets: one real and four synthetic by combining two renderers (CG and neural) with two motion sources (motion capture and video-based). Each of them includes three camera views.

Abstract

This paper explores the use of synthetic data in a human action recognition (HAR) task to avoid the challenges of obtaining and labeling real-world datasets. We introduce a new dataset suite comprising five datasets, eleven common human activities, three synchronized camera views (aerial and ground) in three outdoor environments, and three visual domains (real and two synthetic). For the synthetic data, two rendering methods (standard computer graphics and neural rendering) and two sources of human motions (motion capture and video-based motion reconstruction) were employed. We evaluated each dataset type by training popular activity recognition models and comparing the performance on the real test data. Our results show that synthetic data achieve slightly lower accuracy (4-8%) than real data. On the other hand, a model pre-trained on synthetic data and fine-tuned on limited real data surpasses the performance of either domain alone. Standard computer graphics (CG)-rendered data delivers better performance than the data generated from the neural-based rendering method. The results suggest that the quality of the human motions in the training data also affects the test results: motion capture delivers higher test accuracy. Additionally, a model trained on CG aerial view synthetic data exhibits greater robustness against camera viewpoint changes than one trained on real data.

Activities

Human activity classes presented in our dataset suite.

Video

Datasets

The download links lead to ZIP files hosted on Google Drive.


Real Data

11 activities | 24 subjects | 1.5k sequences | 1.8M frames

  • Aerial View (6.52GB, MD5: cadb0d24d595cdca65ba4d16c6c14109)
  • Ground View (28.54GB, MD5: 445e66b91ecf68390f4505bebee382b2)

Synthetic Data

Human Characters Motion Source
Motion Capture Data
11 Activities | 26 subjects | VICON
Video-based Motions
5 Activities (gestures only) | 15 subjects | VIBE
Rendering Method Computer Graphics
Blender
[SynCG-MC]
25.4k sequences | 31.4M frames
Aerial View (152.93GB)
MD5: a7eec0f4576242d188e74ee10ebc877e
Ground Views (418.78GB)
MD5: ca2955cb44a3ac3c074e53406745f41c
[SynCG-RGB]
6.1k sequences | 5.0M frames
Aerial View (24.43GB)
MD5: f5657e247de7b74a83ab4df79e1b33e5
Ground Views (67.15GB)
MD5: 1cd58dc521c532eaf7a994bffdef62ff
Neural Rendering + Computer Graphics
Liquid Warping GAN + Blender
[SynLWG-MC]
25.4k sequences | 31.2M frames
Aerial View (91.02GB)
MD5: 3a8ae0c7a539a4e52ff3e112fe9f9af9
Ground Views (213.67GB)
MD5: 68a3cdb8019b08fff3cb7d2b9bc05576
[SynLWG-RGB]
6.1k sequences | 5.0M frames
Aerial View (14.52GB)
MD5: f04b912914478515bdbebec84b27b901
Ground Views (34.86GB)
MD5: c00dfd92321415b42704bb2e4b6d9cbe

License

The datasets are released under Creative Commons Attribution 4.0 International (CC BY 4.0) license.

BibTex

@InProceedings{Panev_2024_WACV,
	author    = {Panev, Stanislav and Kim, Emily and Namburu, Sai Abhishek Si and Nikolova, Desislava and de Melo, Celso and De la Torre, Fernando and Hodgins, Jessica},
	title     = {Exploring the Impact of Rendering Method and Motion Quality on Model Performance When Using Multi-View Synthetic Data for Action Recognition},
	booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
	month     = {January},
	year      = {2024},
	pages     = {4592-4602}
}