GAS: Generative Avatar Synthesis from a Single Image

GAS: Generative Avatar Synthesis

from a Single Image

ICCV 2025

1Carnegie Mellon University     2Shanghai AI Laboratory     3Stanford University    

TL;DR: We introduce a unified framework for generative avatar synthesis from a single image, featuring consistent view synthesis and realistic pose animation.

Abstract

We present a unified and generalizable framework for synthesizing view-consistent and temporally coherent avatars from a single image, addressing the challenging task of single-image avatar generation. Existing diffusion-based methods often condition on sparse human templates (e.g., depth or normal maps), which leads to multi-view and temporal inconsistencies due to the mismatch between these signals and the true appearance of the subject. Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model. In a first step, an initial 3D reconstructed human through a generalized NeRF provides comprehensive conditioning, ensuring high-quality synthesis faithful to the reference appearance and structure. Subsequently, the derived geometry and appearance from the generalized NeRF serve as input to a video-based diffusion model. This strategic integration is pivotal for enforcing both multi-view and temporal consistency throughout the avatar's generation. Empirical results underscore the superior generalization ability of our proposed method, demonstrating its effectiveness across diverse in-domain and out-of-domain in-the-wild datasets.

Method

Starting from a single input image, GAS uses a generalizable human NeRF to map the subject into a canonical space, then reposes and renders the 3D NeRF model to extract detailed appearance cues (i.e., NeRF renderings). These are paired with geometry cues (i.e., SMPL normal maps) and fed into a video diffusion model. A switcher module disentangles the tasks, enabling the model to generate either multi-view consistent novel views or temporally coherent pose animations.

pipeline

Applications

Interactive view and pose synthesis

Leveraging the unified framework, we enable interactive synthesis of human avatars, allowing users to synthesize novel views during novel pose animation.

Synchronized Multi-view Video Generation

By alternating the sampling between view and pose synthesis, we can generate synchronized multi-view videos of human performers from only a single image.

Results

Novel view synthesis

We demonstrate the capability of our method to synthesize view-consistent avatars from a single image.

Novel pose animation

We demonstrate the capability of our method to synthesize temporal-coherent avatars with realistic deformations from a single image.

Comparison with baselines

We show the comparison with baselines on the task of novel view synthesis and novel pose animation.

BibTeX

              
@article{lu2025gas,
    title={GAS: Generative Avatar Synthesis from a Single Image},
    author={Lu, Yixing and Dong, Junting and Kwon, Youngjoong and Zhao, Qin and Dai, Bo and De la Torre, Fernando},
    journal={arXiv preprint arXiv:2502.06957},
    year={2025}
    }