This webpage contains data, code, and results from the second GENEA Challenge, intended as a benchmark of data-driven automatic co-speech gesture generation. In the challenge, participating teams used a common speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was then rendered to video using a standardised visualisation and evaluated in several large, crowdsourced user studies. This year’s dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in dyadic conversation, taken from the Talking With Hands 16.2M dataset. Ten teams participated in the evaluation across two tiers: full-body and upper-body gesticulation. For each tier we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech.
The evaluation results are a revolution, and a revelation: Some synthetic conditions are rated as significantly more human-like than human motion capture. At the same time, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings.
Please see our paper for more information, the challenge introduction video below, and the links below for the challenge data, code, and results.
If you use materials from this challenge, please cite our paper about the challenge:
@article{kucherenko2024evaluating,
author = {Kucherenko, Taras and Wolfert, Pieter and Yoon, Youngwoo and Viegas, Carla and Nikolov, Teodor and Tsakov, Mihail and Henter, Gustav Eje},
title = {Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022},
year = {2024},
issue_date = {June 2024},
publisher = {Association for Computing Machinery},
volume = {43},
number = {3},
issn = {0730-0301},
url = {https://doi.org/10.1145/3656374},
doi = {10.1145/3656374},
month = {jun},
articleno = {32},
}
Also consider citing the original paper about the motion data from Meta Research:
@inproceedings{lee2019talking,
title={{T}alking {W}ith {H}ands 16.2{M}: {A} large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis},
author={Lee, Gilwoo and Deng, Zhiwei and Ma, Shugao and Shiratori, Takaaki and Srinivasa, Siddhartha S. and Sheikh, Yaser},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={763--772},
doi={10.1109/ICCV.2019.00085},
series={ICCV '19},
publisher={IEEE},
year={2019}
}