AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation

1Meta Reality Labs  2The University of Tokyo
CVPR 2023
Visualization of automatically generated annotations of 3D hand poses.
(Row 1-4: exocentric views, Row 5: egocentric views)


description [May 25th 2023] Data & code released.
description [April 2nd 2023] Project page released.


We present AssemblyHands, a large-scale benchmark dataset with accurate 3D hand pose annotations, to facilitate the study of egocentric activities with challenging hand-object interactions. The dataset includes synchronized egocentric and exocentric images sampled from the recent Assembly101 dataset, in which participants assemble and disassemble take-apart toys. To obtain high-quality 3D hand pose annotations for the egocentric images, we develop an efficient pipeline, where we use an initial set of manual annotations to train a model to automatically annotate a much larger dataset. Our annotation model uses multi-view feature fusion and an iterative refinement scheme, and achieves an average keypoint error of 4.20 mm, which is 85 % lower than the error of the original annotations in Assembly101. AssemblyHands provides 3.0M annotated images, including 490K egocentric images, making it the largest existing benchmark dataset for egocentric 3D hand pose estimation. Using this data, we develop a strong single-view baseline of 3D hand pose estimation from egocentric images. Furthermore, we design a novel action classification task to evaluate predicted 3D hand poses. Our study shows that having higher-quality hand poses directly improves the ability to recognize actions.

High-quality 3D hand poses as an effective representation for egocentric activity understanding. AssemblyHands provides high-quality 3D hand pose annotations computed from multi-view exocentric images sampled from Assembly101, which originally comes with inaccurate annotations computed from egocentric images (see the incorrect left-hand pose prediction). As we experimentally demonstrate on an action classification task, models trained on high-quality annotations achieve significantly higher accuracy.


Comparison of AssemblyHands with existing 3D hand pose datasets. “M” and “A” stand for manual and automatic annotation, respectively. AssemblyHands is the largest existing benchmark for egocentric 3D hand pose estimation.


Creative Commons License
AssemblyHands is licensed by us under a Creative Commons Attribution-NonCommercial 4.0 International License. The terms of this license are:

Attribution : You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

NonCommercial : You may not use the material for commercial purposes.


    title     = {{AssemblyHands:} Towards Egocentric Activity Understanding via 3D Hand Pose Estimation},
    author    = {Takehiko Ohkawa and Kun He and Fadime Sener and Tomas Hodan and Luan Tran and Cem Keskin},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages     = {12999--13008},
    year      = {2023},