3D Head Avatars from just 70K Random Internet Images! No 3D, no multi-view, no studio, no view synthesis at any stage of training or inference.
Although conditional generation is not the primary focus of this work, MVCHead is sufficiently structured to support personalized avatar synthesis while preserving Multi-view consistency.
High-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend on multi-view datasets, 3D captures, or intermediate 2D view synthesis. In contrast, we learn both conditional and unconditional 3D head models from randomly sampled 2D images alone, without using multi-view data, 3D supervision, or intermediate view generation. We introduce MVCHead, a single-shot state space model that enforces multi-view consistency (MVC) directly in the 3D representation while regressing 3D Gaussians under these constraints. At its core, we propose a Hierarchical State Space (HiSS) block that progressively refines Gaussians from coarse to fine, while capturing long-range dependencies. Within each HiSS block, we modify Mamba's standard unidirectional scan with the proposed Hierarchical Bi-directional State Scan (HiBiSS) that aligns recurrence with the axes along which multi-view inconsistencies are strongest. Finally, we design an SE(3) Multi-view Critic that judges whether a set of self-renders arises from a single underlying 3D configuration, rewarding cross-view pixel alignment without observing real multi-view pairs. MVCHead achieves state-of-the-art perceptual quality, surpasses prior methods in both texture and geometric consistency, and maintains comparable shape consistency. To demonstrate scalability, we release FaceGS-10K, the first large-scale dataset of ready-to-use 3D Gaussian head assets for training and evaluation of 3D head models.
Prior work on 3D Gaussian head avatars falls into three categories, distinguished by their data and supervision requirements:
We measure cross-view consistency using MEt3R across the three paradigms:
The key insight: self-renders from any fixed 3D Gaussian model are geometrically consistent by construction. This lets our SE(3) Multi-view Critic learn to distinguish plausible 3D configurations from inconsistent ones.
The first state space model for 3D Gaussian heads. A single-shot, real-time, end-to-end differentiable pipeline that directly predicts 240K Gaussians from a latent code via stacked HiSS blocks; HiBiSS aligns recurrence with the principal axes of multi-view drift reconciling typical view-to-view inconsistencies.
A learned reward that judges whether a set of self-renders arises from a single underlying 3D configuration. It rewards cross-view pixel alignment without ever observing real multi-view pairs, inducing MVC by design.
First large-scale dataset of ready-to-use 3D Gaussian head assets, independent of any parametric 3D head model. Each asset contains 240K anisotropic Gaussians and 24 renderings at 512×512, enabling training, benchmarking, and evaluation of 3D-aware head models.
MVCHead takes a latent code and produces a complete set of 3D Gaussians in a single forward pass. A stack of HiSS blocks progressively refines Gaussians from coarse to fine, while HiBiSS scans propagate geometric and appearance cues along the principal axes of multi-view drift. The resulting Gaussians are rasterized and evaluated jointly by an adversarial texture discriminator and our SE(3) Multi-view Critic, inducing multi-view consistency by design, without any 3D supervision.
L stacked HiSS blocks progressively refine Gaussians via anchor-based offsets, growing the Gaussian count per block until a 240K Gaussian budget is reached.
Within each block, four directional scans (← → ↑ ↓) align state-space recurrence with the axes where multi-view inconsistency is strongest, enabling anisotropic, pose-aware smoothing.
A GTA-augmented ViT evaluates whether self-renders arise from a single 3D configuration, providing a differentiable MVC reward trained as a binary set classifier against latent-mismatched negatives.
We train MVCHead independently on FFHQ and FFHQ-C following the established experimental protocol, and benchmark against state-of-the-art generative 3D head models. We report perceptual realism (FID, FID3D) and three axes of multi-view consistency: shape, texture, and geometric, adapting metrics from MVGBench and MEt3R to provide the first comprehensive quantitative MVC assessment for 3D head avatars.
Comparison of FID scores at 512×512 resolution. MVCHead achieves state-of-the-art on both benchmarks despite operating in the minimal-resource setting. †Uses super-resolution network. *Results reported from the original paper.
| Method | Venue | FID (FFHQ) ↓ | FID (FFHQ-C) ↓ |
|---|---|---|---|
| StyleSDF | CVPR 2022 | 11.2† | — |
| EpiGRAF | NeurIPS 2022 | 9.92 | — |
| VoxGRAF | NeurIPS 2022 | 9.0 | — |
| GMPI | ECCV 2022 | 8.29 | — |
| StyleNeRF | ICLR 2022 | 7.80† | — |
| EG3D | CVPR 2022 | 4.70† | — |
| Mimic3D | ICCV 2023 | 5.37 | — |
| GSM | CVPR 2024 | 28.19 | — |
| GSGAN | NeurIPS 2024 | 5.60 | 5.17 |
| GGHead | SIGGRAPH Asia 2024 | 5.15* | 5.37 |
| CGSGAN | NeurIPS 2025 | 4.94 | 4.53 |
| MVCHead | Ours | 4.39 | 3.94 |
FID3D scores at 512×512, with camera poses randomly sampled across a wider range of viewpoints to probe realism at arbitrary angles.
| Method | Venue | FID3D (FFHQ) ↓ | FID3D (FFHQ-C) ↓ |
|---|---|---|---|
| GSGAN | NeurIPS 2024 | 10.50 | 7.68 |
| GGHead | SIGGRAPH Asia 2024 | 7.90 | 7.78 |
| CGSGAN | NeurIPS 2025 | 4.94 | 4.53 |
| MVCHead | Ours | 4.39 | 3.94 |
Consistency scores across shape, texture, and geometric axes, averaged over 100 avatars. MVCHead achieves state-of-the-art on five of six metrics; depth error is comparable between the two methods.
| Method | Shape | Texture | Geometric | |||
|---|---|---|---|---|---|---|
| CD ↓ | depth ↓ | cPSNR ↑ | cSSIM ↑ | cLPIPS ↓ | MEt3R ↓ | |
| CGSGAN | 0.6724 | 6.6624 | 21.852 | 0.7434 | 0.0622 | 0.2814 |
| MVCHead | 0.6654 | 6.6649 | 22.082 | 0.7636 | 0.0528 | 0.2620 |
Ablations on FFHQ-C at 512×512. Each proposed component contributes meaningfully to both realism (FID) and multi-view consistency (MEt3R). The SE(3) Multi-view Critic and HiBiSS are especially critical for cross-view consistency.
| Configuration | FID ↓ | MEt3R ↓ |
|---|---|---|
| Full Model | 3.94 | 0.2620 |
| Loss | ||
| w/o ℒadv | training collapse | |
| w/o ℒmvc (SE(3) Critic) | 5.41 | 0.3144 |
| Model Architecture | ||
| w/o HiSS Block | 5.28 | 0.2948 |
| w/o HiBiSS Scan | 4.78 | 0.2873 |
FaceGS-10K dataset is the first large-scale dataset of ready-to-use 3D Gaussian head assets, independent of any parametric 3D head model. Each asset contains 240K anisotropic Gaussians and 24 renderings at 512x512, enabling training, benchmarking, and evaluation of 3D-aware head models.