Exploring the Impact of Rendering Method and Motion Quality on Model Performance when Using Multi-view Synthetic Data for Action Recognition

Abstract

This paper explores the use of synthetic data in a human action recognition (HAR) task to avoid the challenges of obtaining and labeling real-world datasets. We introduce a new dataset suite comprising five datasets, eleven common human activities, three synchronized camera views (aerial and ground) in three outdoor environments, and three visual domains (real and two synthetic). For the synthetic data, two rendering methods (standard computer graphics and neural rendering) and two sources of human motions (motion capture and video-based motion reconstruction) were employed. We evaluated each dataset type by training popular activity recognition models and comparing the performance on the real test data. Our results show that synthetic data achieve slightly lower accuracy (4-8%) than real data. On the other hand, a model pre-trained on synthetic data and fine-tuned on limited real data surpasses the performance of either domain alone. Standard computer graphics (CG)-rendered data delivers better performance than the data generated from the neural-based rendering method. The results suggest that the quality of the human motions in the training data also affects the test results: motion capture delivers higher test accuracy. Additionally, a model trained on CG aerial view synthetic data exhibits greater robustness against camera viewpoint changes than one trained on real data.

Activities

Human activity classes presented in our dataset suite.

Video

Datasets

The download links lead to ZIP files.

Real Data

11 activities | 24 subjects | 1.5k sequences | 1.8M frames

Aerial View (6.52GB, MD5: cadb0d24d595cdca65ba4d16c6c14109)
Google Drive | Alternative Hosting

Ground View (28.54GB, MD5: 445e66b91ecf68390f4505bebee382b2)
Google Drive | Alternative Hosting

Synthetic Data

		Human Characters Motion Source
		Motion Capture Data 11 Activities \| 26 subjects \| VICON	Video-based Motions 5 Activities (gestures only) \| 15 subjects \| VIBE
Rendering Method	Computer Graphics Blender	[SynCG-MC] 25.4k sequences \| 31.4M frames Aerial View (152.93GB) MD5: a7eec0f4576242d188e74ee10ebc877e Google Drive \| Alternative Hosting Ground View (418.78GB) MD5: ca2955cb44a3ac3c074e53406745f41c Google Drive \| Alternative Hosting	[SynCG-RGB] 6.1k sequences \| 5.0M frames Aerial View (24.43GB) MD5: f5657e247de7b74a83ab4df79e1b33e5 Google Drive \| Alternative Hosting Ground view (67.15GB) MD5: 1cd58dc521c532eaf7a994bffdef62ff Google Drive \| Alternative Hosting
Rendering Method	Neural Rendering + Computer Graphics Liquid Warping GAN + Blender	[SynLWG-MC] 25.4k sequences \| 31.2M frames Aerial View (91.02GB) MD5: 3a8ae0c7a539a4e52ff3e112fe9f9af9 Google Drive \| Alternative Hosting Ground view (213.67GB) MD5: 68a3cdb8019b08fff3cb7d2b9bc05576 Google Drive \| Alternative Hosting	[SynLWG-RGB] 6.1k sequences \| 5.0M frames Aerial View (14.52GB) MD5: f04b912914478515bdbebec84b27b901 Google Drive \| Alternative Hosting Ground view (34.86GB) MD5: c00dfd92321415b42704bb2e4b6d9cbe Google Drive \| Alternative Hosting

License

The datasets are released under Creative Commons Attribution 4.0 International (CC BY 4.0) license.

BibTex

@InProceedings{Panev_2024_WACV,
	author    = {Panev, Stanislav and Kim, Emily and Namburu, Sai Abhishek Si and Nikolova, Desislava and de Melo, Celso and De la Torre, Fernando and Hodgins, Jessica},
	title     = {Exploring the Impact of Rendering Method and Motion Quality on Model Performance When Using Multi-View Synthetic Data for Action Recognition},
	booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
	month     = {January},
	year      = {2024},
	pages     = {4592-4602}
}

		Human Characters Motion Source
		Motion Capture Data 11 Activities \| 26 subjects \| VICON	Video-based Motions 5 Activities (gestures only) \| 15 subjects \| VIBE
Rendering Method	Computer Graphics Blender	[SynCG-MC] 25.4k sequences \| 31.4M frames Aerial View (152.93GB) MD5: a7eec0f4576242d188e74ee10ebc877e Google Drive \| Alternative Hosting Ground View (418.78GB) MD5: ca2955cb44a3ac3c074e53406745f41c Google Drive \| Alternative Hosting	[SynCG-RGB] 6.1k sequences \| 5.0M frames Aerial View (24.43GB) MD5: f5657e247de7b74a83ab4df79e1b33e5 Google Drive \| Alternative Hosting Ground view (67.15GB) MD5: 1cd58dc521c532eaf7a994bffdef62ff Google Drive \| Alternative Hosting
Rendering Method	Neural Rendering + Computer Graphics Liquid Warping GAN + Blender	[SynLWG-MC] 25.4k sequences \| 31.2M frames Aerial View (91.02GB) MD5: 3a8ae0c7a539a4e52ff3e112fe9f9af9 Google Drive \| Alternative Hosting Ground view (213.67GB) MD5: 68a3cdb8019b08fff3cb7d2b9bc05576 Google Drive \| Alternative Hosting	[SynLWG-RGB] 6.1k sequences \| 5.0M frames Aerial View (14.52GB) MD5: f04b912914478515bdbebec84b27b901 Google Drive \| Alternative Hosting Ground view (34.86GB) MD5: c00dfd92321415b42704bb2e4b6d9cbe Google Drive \| Alternative Hosting

Exploring the Impact of Rendering Method and Motion Quality on Model Performance when Using a Multi-view Synthetic Data for Action Recognition

WACV 2024

REMAG - an HAR dataset suite comprises five datasets: one real and four synthetic by combining two renderers (CG and neural) with two motion sources (motion capture and video-based). Each of them includes three camera views.