Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation

Chharia, Aviral; De la Torre, Fernando

Abstract

Learning 3D Heads Without 3D Data

High-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend on multi-view datasets, 3D captures, or intermediate 2D view synthesis. In contrast, we learn both conditional and unconditional 3D head models from randomly sampled 2D images alone, without using multi-view data, 3D supervision, or intermediate view generation. We introduce MVCHead, a single-shot state space model that enforces multi-view consistency (MVC) directly in the 3D representation while regressing 3D Gaussians under these constraints. At its core, we propose a Hierarchical State Space (HiSS) block that progressively refines Gaussians from coarse to fine, while capturing long-range dependencies. Within each HiSS block, we modify Mamba's standard unidirectional scan with the proposed Hierarchical Bi-directional State Scan (HiBiSS) that aligns recurrence with the axes along which multi-view inconsistencies are strongest. Finally, we design an SE(3) Multi-view Critic that judges whether a set of self-renders arises from a single underlying 3D configuration, rewarding cross-view pixel alignment without observing real multi-view pairs. MVCHead achieves state-of-the-art perceptual quality, surpasses prior methods in both texture and geometric consistency, and maintains comparable shape consistency. To demonstrate scalability, we release FaceGS-10K, the first large-scale dataset of ready-to-use 3D Gaussian head assets for training and evaluation of 3D head models.

Motivation

How does MVCHead differ from broader 3D head generation paradigms?

Prior work on 3D Gaussian head avatars falls into three categories, distinguished by their data and supervision requirements:

Optimization-based methods reconstruct heads from dense studio captures (e.g., NeRSemble, RenderMe-360). High multi-view consistency, but expensive rigs and slow per-subject optimization.
Intermediate view synthesis methods first generate side views via image/video diffusion, then reconstruct a 3DGS. High fidelity, but slow generation and consistency is bottlenecked by the view generator.
Feed-forward 3D generators directly produce 3D Gaussians from a latent code, avoiding per-subject optimization. Fast and scalable, but enforcing multi-view consistency without explicit multi-view supervision remains an open challenge in this paradigm, where the model never observes real multi-view pairs.

Figure 2. Paradigms for 3D Gaussian head avatar generation. (a) Optimization-based methods require expensive studio captures. (b) Intermediate view synthesis methods generate side views before reconstruction. (c) Our paradigm feed-forward generators (including MVCHead) that learn directly from 2D images, with no intermediate generation or 3D data.

Can a model trained without multi-view data match studio-grade consistency?

We measure cross-view consistency using MEt3R across the three paradigms:

Studio capture: the gold standard (high MVC).
Intermediate view synthesis: diffusion-generated side views accumulate inconsistency.
MVCHead self-renders: Approaching studio-grade quality without any 3D supervision.

The key insight: self-renders from any fixed 3D Gaussian model are geometrically consistent by construction. This lets our SE(3) Multi-view Critic learn to distinguish plausible 3D configurations from inconsistent ones.

MEt3R consistency comparison between studio capture, intermediate view synthesis, and MVCHead self-renders

Figure 3. Self-renders provide a strong MVC prior. Per-pixel consistency maps (dark = consistent, bright = inconsistent) computed via MASt3R correspondences and FeatUp-DINO features. (a) NeRSemble (studio capture): MEt3R = 0.207. (b) CAP4D (intermediate view synthesis): MEt3R = 0.312. (c) MVCHead (Ours): MEt3R = 0.231 approaching studio-grade consistency without any 3D supervision.

Key Contributions

What We Contribute

1. MVCHead Architecture

The first state space model for 3D Gaussian heads. A single-shot, real-time, end-to-end differentiable pipeline that directly predicts 240K Gaussians from a latent code via stacked HiSS blocks; HiBiSS aligns recurrence with the principal axes of multi-view drift reconciling typical view-to-view inconsistencies.

2. SE(3) Multi-view Critic

A learned reward that judges whether a set of self-renders arises from a single underlying 3D configuration. It rewards cross-view pixel alignment without ever observing real multi-view pairs, inducing MVC by design.

3. FaceGS-10K Dataset

First large-scale dataset of ready-to-use 3D Gaussian head assets, independent of any parametric 3D head model. Each asset contains 240K anisotropic Gaussians and 24 renderings at 512×512, enabling training, benchmarking, and evaluation of 3D-aware head models.

Method

Pipeline Overview

MVCHead takes a latent code and produces a complete set of 3D Gaussians in a single forward pass. A stack of HiSS blocks progressively refines Gaussians from coarse to fine, while HiBiSS scans propagate geometric and appearance cues along the principal axes of multi-view drift. The resulting Gaussians are rasterized and evaluated jointly by an adversarial texture discriminator and our SE(3) Multi-view Critic, inducing multi-view consistency by design, without any 3D supervision.

Figure 4. Model Architecture. MVCHead with its key components: HiSS blocks that hierarchically regress 3D Gaussian parameters (each Gaussian S_l becomes the anchor A_l for the next level S_l+1), the Hierarchical Bi-directional State Scan (HiBiSS) operating in all four directions, and the SE(3) Multi-view Critic that enforces MVC.

Stage 1

Hierarchical HiSS Blocks

L stacked HiSS blocks progressively refine Gaussians via anchor-based offsets, growing the Gaussian count per block until a 240K Gaussian budget is reached.

Stage 2

HiBiSS Scanning

Within each block, four directional scans (← → ↑ ↓) align state-space recurrence with the axes where multi-view inconsistency is strongest, enabling anisotropic, pose-aware smoothing.

Stage 3

SE(3) Multi-view Critic

A GTA-augmented ViT evaluates whether self-renders arise from a single 3D configuration, providing a differentiable MVC reward trained as a binary set classifier against latent-mismatched negatives.

Experiments and Results

Comparison to Prior Art

Following previous works, we train MVCHead independently on FFHQ and FFHQ-C datasets, and benchmark against state-of-the-art generative 3D head models. We report perceptual realism (FID, FID_3D) and three axes of multi-view consistency: shape, texture, and geometric, adapting metrics from MVGBench and MEt3R to provide the first comprehensive quantitative MVC assessment for 3D head avatars.

Perceptual Realism (FID)

Comparison of FID scores at 512×512 resolution. MVCHead achieves state-of-the-art on both benchmarks despite operating in the minimal-resource setting. ^†Uses super-resolution network. ^*Results reported from the original paper.

Method	Venue	FID (FFHQ) ↓	FID (FFHQ-C) ↓
StyleSDF	CVPR 2022	11.2^†	—
EpiGRAF	NeurIPS 2022	9.92	—
VoxGRAF	NeurIPS 2022	9.0	—
GMPI	ECCV 2022	8.29	—
StyleNeRF	ICLR 2022	7.80^†	—
EG3D	CVPR 2022	4.70^†	—
Mimic3D	ICCV 2023	5.37	—
GSM	CVPR 2024	28.19	—
GSGAN	NeurIPS 2024	5.60	5.17
GGHead	SIGGRAPH Asia 2024	5.15^*	5.37
CGSGAN	NeurIPS 2025	4.94	4.53
MVCHead	Ours	4.39	3.94

Perceptual Realism at Extremes (FID_3D)

FID_3D scores at 512×512, with camera poses randomly sampled across a wider range of viewpoints to probe realism at arbitrary angles.

Method	Venue	FID_3D (FFHQ) ↓	FID_3D (FFHQ-C) ↓
GSGAN	NeurIPS 2024	10.50	7.68
GGHead	SIGGRAPH Asia 2024	7.90	7.78
CGSGAN	NeurIPS 2025	4.94	4.53
MVCHead	Ours	4.39	3.94

Multi-view Consistency

Consistency scores across shape, texture, and geometric axes, averaged over 100 avatars. MVCHead achieves state-of-the-art on five of six metrics; depth error is comparable between the two methods.

Method	Shape		Texture			Geometric
Method	CD ↓	depth ↓	cPSNR ↑	cSSIM ↑	cLPIPS ↓	MEt3R ↓
CGSGAN	0.6724	6.6624	21.852	0.7434	0.0622	0.2814
MVCHead	0.6654	6.6649	22.082	0.7636	0.0528	0.2620

Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation

Learning 3D Heads Without 3D Data

How does MVCHead differ from broader 3D head generation paradigms?

Can a model trained without multi-view data match studio-grade consistency?

Personalized Avatar Synthesis

What We Contribute

1. MVCHead Architecture

2. SE(3) Multi-view Critic

3. FaceGS-10K Dataset

Pipeline Overview

Hierarchical HiSS Blocks

HiBiSS Scanning

SE(3) Multi-view Critic

Comparison to Prior Art

Perceptual Realism (FID)

Perceptual Realism at Extremes (FID_3D)

Multi-view Consistency

FaceGS-10K

BibTeX

Learning 3D Heads Without 3D Data

How does MVCHead differ from broader 3D head generation paradigms?

Can a model trained without multi-view data match studio-grade consistency?

Personalized Avatar Synthesis

What We Contribute

1. MVCHead Architecture

2. SE(3) Multi-view Critic

3. FaceGS-10K Dataset

Pipeline Overview

Hierarchical HiSS Blocks

HiBiSS Scanning

SE(3) Multi-view Critic

Comparison to Prior Art

Perceptual Realism (FID)

Perceptual Realism at Extremes (FID3D)

Multi-view Consistency

FaceGS-10K

BibTeX

Perceptual Realism at Extremes (FID_3D)