CVPR 2026

Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation

3D Head Avatars from just 70K Random Internet Images! No 3D, no multi-view, no studio, no view synthesis at any stage of training or inference.

Aviral Chharia1 Fernando De la Torre1
1Carnegie Mellon University
CMU Robotics Institute Human Sensing Lab
Paper Code Poster Suppl (Coming Soon) Dataset (Coming Soon)
TL;DR — MVCHead generates high-fidelity, multi-view consistent 3D Gaussian head avatars from random 2D images alone in a real time single forward pass, without any 3D supervision, multi-view captures, studio rigs, or intermediate view synthesis.
Method overview
Figure 1. The generated Gaussian heads by MVCHead capture complex textures and fine facial micro-structure, including wrinkles, hair wisps, ear rims, lip contours, skin blemishes, eyes, and accessories.
3.94
FID on FFHQ-C
SOTA Perceptual Realism
Real-time
Single-shot
Inference
240K
Gaussians per Avatar
Fine Facial Micro-structure
10K
3D Head Assets
FaceGS-10K Dataset

Teasor Results

Although conditional generation is not the primary focus of this work, MVCHead is sufficiently structured to support personalized avatar synthesis while preserving Multi-view consistency.

Learning 3D Heads Without 3D Data

High-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend on multi-view datasets, 3D captures, or intermediate 2D view synthesis. In contrast, we learn both conditional and unconditional 3D head models from randomly sampled 2D images alone, without using multi-view data, 3D supervision, or intermediate view generation. We introduce MVCHead, a single-shot state space model that enforces multi-view consistency (MVC) directly in the 3D representation while regressing 3D Gaussians under these constraints. At its core, we propose a Hierarchical State Space (HiSS) block that progressively refines Gaussians from coarse to fine, while capturing long-range dependencies. Within each HiSS block, we modify Mamba's standard unidirectional scan with the proposed Hierarchical Bi-directional State Scan (HiBiSS) that aligns recurrence with the axes along which multi-view inconsistencies are strongest. Finally, we design an SE(3) Multi-view Critic that judges whether a set of self-renders arises from a single underlying 3D configuration, rewarding cross-view pixel alignment without observing real multi-view pairs. MVCHead achieves state-of-the-art perceptual quality, surpasses prior methods in both texture and geometric consistency, and maintains comparable shape consistency. To demonstrate scalability, we release FaceGS-10K, the first large-scale dataset of ready-to-use 3D Gaussian head assets for training and evaluation of 3D head models.

How does MVCHead differ from broader 3D head generation paradigms?

Prior work on 3D Gaussian head avatars falls into three categories, distinguished by their data and supervision requirements:

  • Optimization-based methods reconstruct heads from dense studio captures (e.g., NeRSemble, RenderMe-360). High multi-view consistency, but expensive rigs and slow per-subject optimization.
  • Intermediate view synthesis methods first generate side views via image/video diffusion, then reconstruct a 3DGS. High fidelity, but slow generation and consistency is bottlenecked by the view generator.
  • Feed-forward 3D generators directly produce 3D Gaussians from a latent code, avoiding per-subject optimization. Fast and scalable, but enforcing multi-view consistency without explicit multi-view supervision remains an open challenge in this paradigm, where the model never observes real multi-view pairs.
Comparison of 3D Gaussian head avatar generation paradigms
Figure 2. Paradigms for 3D Gaussian head avatar generation. (a) Optimization-based methods require expensive studio captures. (b) Intermediate view synthesis methods generate side views before reconstruction. (c) Our paradigm feed-forward generators (including MVCHead) that learn directly from 2D images, with no intermediate generation or 3D data.

Can a model trained without multi-view data match studio-grade consistency?

We measure cross-view consistency using MEt3R across the three paradigms:

  • Studio capture: the gold standard (high MVC).
  • Intermediate view synthesis: diffusion-generated side views accumulate inconsistency.
  • MVCHead self-renders: Approaching studio-grade quality without any 3D supervision.

The key insight: self-renders from any fixed 3D Gaussian model are geometrically consistent by construction. This lets our SE(3) Multi-view Critic learn to distinguish plausible 3D configurations from inconsistent ones.

MEt3R consistency comparison between studio capture, intermediate view synthesis, and MVCHead self-renders
Figure 3. Self-renders provide a strong MVC prior. Per-pixel consistency maps (dark = consistent, bright = inconsistent) computed via MASt3R correspondences and FeatUp-DINO features. (a) NeRSemble (studio capture): MEt3R = 0.207. (b) CAP4D (intermediate view synthesis): MEt3R = 0.312. (c) MVCHead (Ours): MEt3R = 0.231 approaching studio-grade consistency without any 3D supervision.

What We Contribute

1. MVCHead Architecture

The first state space model for 3D Gaussian heads. A single-shot, real-time, end-to-end differentiable pipeline that directly predicts 240K Gaussians from a latent code via stacked HiSS blocks; HiBiSS aligns recurrence with the principal axes of multi-view drift reconciling typical view-to-view inconsistencies.

2. SE(3) Multi-view Critic

A learned reward that judges whether a set of self-renders arises from a single underlying 3D configuration. It rewards cross-view pixel alignment without ever observing real multi-view pairs, inducing MVC by design.

3. FaceGS-10K Dataset

First large-scale dataset of ready-to-use 3D Gaussian head assets, independent of any parametric 3D head model. Each asset contains 240K anisotropic Gaussians and 24 renderings at 512×512, enabling training, benchmarking, and evaluation of 3D-aware head models.

Pipeline Overview

MVCHead takes a latent code and produces a complete set of 3D Gaussians in a single forward pass. A stack of HiSS blocks progressively refines Gaussians from coarse to fine, while HiBiSS scans propagate geometric and appearance cues along the principal axes of multi-view drift. The resulting Gaussians are rasterized and evaluated jointly by an adversarial texture discriminator and our SE(3) Multi-view Critic, inducing multi-view consistency by design, without any 3D supervision.

Pipeline
Figure 4. Model Architecture. MVCHead with its key components: HiSS blocks that hierarchically regress 3D Gaussian parameters (each Gaussian Sl becomes the anchor Al for the next level Sl+1), the Hierarchical Bi-directional State Scan (HiBiSS) operating in all four directions, and the SE(3) Multi-view Critic that enforces MVC.
Stage 1

Hierarchical HiSS Blocks

L stacked HiSS blocks progressively refine Gaussians via anchor-based offsets, growing the Gaussian count per block until a 240K Gaussian budget is reached.

Stage 2

HiBiSS Scanning

Within each block, four directional scans (← → ↑ ↓) align state-space recurrence with the axes where multi-view inconsistency is strongest, enabling anisotropic, pose-aware smoothing.

Stage 3

SE(3) Multi-view Critic

A GTA-augmented ViT evaluates whether self-renders arise from a single 3D configuration, providing a differentiable MVC reward trained as a binary set classifier against latent-mismatched negatives.

Experiments and Results

We train MVCHead independently on FFHQ and FFHQ-C following the established experimental protocol, and benchmark against state-of-the-art generative 3D head models. We report perceptual realism (FID, FID3D) and three axes of multi-view consistency: shape, texture, and geometric, adapting metrics from MVGBench and MEt3R to provide the first comprehensive quantitative MVC assessment for 3D head avatars.

Perceptual Realism (FID)

Comparison of FID scores at 512×512 resolution. MVCHead achieves state-of-the-art on both benchmarks despite operating in the minimal-resource setting. Uses super-resolution network. *Results reported from the original paper.

MethodVenueFID (FFHQ) ↓FID (FFHQ-C) ↓
StyleSDFCVPR 202211.2
EpiGRAFNeurIPS 20229.92
VoxGRAFNeurIPS 20229.0
GMPIECCV 20228.29
StyleNeRFICLR 20227.80
EG3DCVPR 20224.70
Mimic3DICCV 20235.37
GSMCVPR 202428.19
GSGANNeurIPS 20245.605.17
GGHeadSIGGRAPH Asia 20245.15*5.37
CGSGANNeurIPS 20254.944.53
MVCHeadOurs4.393.94

Perceptual Realism at Extremes (FID3D)

FID3D scores at 512×512, with camera poses randomly sampled across a wider range of viewpoints to probe realism at arbitrary angles.

MethodVenueFID3D (FFHQ) ↓FID3D (FFHQ-C) ↓
GSGANNeurIPS 202410.507.68
GGHeadSIGGRAPH Asia 20247.907.78
CGSGANNeurIPS 20254.944.53
MVCHeadOurs4.393.94

Multi-view Consistency

Consistency scores across shape, texture, and geometric axes, averaged over 100 avatars. MVCHead achieves state-of-the-art on five of six metrics; depth error is comparable between the two methods.

Method Shape Texture Geometric
CD ↓ depth ↓ cPSNR ↑ cSSIM ↑ cLPIPS ↓ MEt3R ↓
CGSGAN 0.6724 6.6624 21.852 0.7434 0.0622 0.2814
MVCHead 0.6654 6.6649 22.082 0.7636 0.0528 0.2620

Ablation Study

Ablations on FFHQ-C at 512×512. Each proposed component contributes meaningfully to both realism (FID) and multi-view consistency (MEt3R). The SE(3) Multi-view Critic and HiBiSS are especially critical for cross-view consistency.

ConfigurationFID ↓MEt3R ↓
Full Model3.940.2620
Loss
  w/o ℒadvtraining collapse
  w/o ℒmvc (SE(3) Critic)5.410.3144
Model Architecture
  w/o HiSS Block5.280.2948
  w/o HiBiSS Scan4.780.2873

FaceGS-10K

FaceGS-10K dataset is the first large-scale dataset of ready-to-use 3D Gaussian head assets, independent of any parametric 3D head model. Each asset contains 240K anisotropic Gaussians and 24 renderings at 512x512, enabling training, benchmarking, and evaluation of 3D-aware head models.

BibTeX

@InProceedings{Chharia_2026_CVPR, author = {Chharia, Aviral and De la Torre, Fernando}, title = {Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {40163-40174} }
Paper