3D Hand reconstruction from a single RGB image is challenging due to the articulated motion, self-occlusion, and interaction with objects. Existing SOTA methods employ attention-based transformers to learn the 3D hand pose and shape, but they fail to achieve robust and accurate performance due to insufficient modeling of joint spatial relations. To address this problem, we propose a novel graph-guided Mamba framework, named Hamba, which bridges graph learning and state space modeling. Our core idea is to reformulate Mamba's scanning into graph-guided bidirectional scanning for 3D reconstruction using a few effective tokens. This enables us to learn the joint relations and spatial sequences for enhancing the reconstruction performance. Specifically, we design a novel Graph-guided State Space (GSS) block that learns the graph-structured relations and spatial sequences of joints and uses 88.5% fewer tokens than attention-based methods. Additionally, we integrate the state space features and the global features using a fusion module. By utilizing the GSS block and the fusion module, Hamba effectively leverages the graph-guided state space modeling features and jointly considers global and local features to improve performance. Extensive experiments on several benchmarks and in-the-wild tests demonstrate that Hamba significantly outperforms existing SOTAs, achieving the PA-MPVPE of 5.3mm and F@15mm of 0.992 on FreiHAND. Hamba is currently Rank 1 in two challenging competition leaderboards: HO3Dv2 and HO3Dv3 Competition Leaderboards on 3D hand reconstruction.
Given a hand image I, Hamba extracts the tokens via a backbone model and downsample layers. We design a graph-guided SSM as a decoder to regress hand parameters. Here, hand joints are regressed by the Joints Regressor (JR) which is then input to the Token Sampler (TS) to sample tokens \( T_{\text{TS}} \). The joint spatial sequence tokens \( T_{\text{GSS}} \) are learned by the Graph-guided State Space (GSS) blocks. Note that GCN takes \( T_{\text{TS}} \) as input while its output is concatenated with the mean downsampled tokens. GSS leverages graph learning and SSM to capture the joint spatial relations to achieve robust 3D reconstruction.
We are the first to incorporate graph learning and state space modeling for reconstructing robust 3D hand mesh. Our key idea is to reformulate the Mamba scanning into graph-guided bidirectional scanning for 3D reconstruction using a few effective tokens.
We propose a simple yet effective GSS block to capture structured relations between hand joints using graph convolution layers and Mamba blocks.
We introduce a token sampler that effectively boosts performance. A fusion module is also introduced to further enhance performance by integrating state space tokens and global features.
Extensive experiments on multiple challenging benchmarks demonstrate Hamba's superiority over current SOTAs, achieving impressive performance for in-the-wild scenarios.
Qualitative in-the-wild comparison of the proposed Hamba with SOTAs on HInt EpicKitchensVISOR. No models including Hamba have been trained on HInt.
Qualitative in-the-wild Results on HInt-NewDays and EpicKitchensVISOR datasets, which includes highly-occluded hands, hand-hand or hand-object interactions, and truncation scenarios. We did not use the HInt dataset to train/ fine-tune Hamba.
@misc{dong2024hambasingleview3dhand,
title={Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba},
author={Haoye Dong and Aviral Chharia and Wenbo Gou and Francisco Vicente Carrasco and Fernando De la Torre},
year={2024},
eprint={2407.09646},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.09646},
}