1Carnegie Mellon University 2Shanghai AI Laboratory 3Stanford University
We present a unified and generalizable framework for synthesizing view-consistent and temporally coherent avatars from a single image, addressing the challenging task of single-image avatar generation. Existing diffusion-based methods often condition on sparse human templates (e.g., depth or normal maps), which leads to multi-view and temporal inconsistencies due to the mismatch between these signals and the true appearance of the subject. Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model. In a first step, an initial 3D reconstructed human through a generalized NeRF provides comprehensive conditioning, ensuring high-quality synthesis faithful to the reference appearance and structure. Subsequently, the derived geometry and appearance from the generalized NeRF serve as input to a video-based diffusion model. This strategic integration is pivotal for enforcing both multi-view and temporal consistency throughout the avatar's generation. Empirical results underscore the superior generalization ability of our proposed method, demonstrating its effectiveness across diverse in-domain and out-of-domain in-the-wild datasets.
Starting from a single input image, GAS uses a generalizable human NeRF to map the subject into a canonical space, then reposes and renders the 3D NeRF model to extract detailed appearance cues (i.e., NeRF renderings). These are paired with geometry cues (i.e., SMPL normal maps) and fed into a video diffusion model. A switcher module disentangles the tasks, enabling the model to generate either multi-view consistent novel views or temporally coherent pose animations.
Leveraging the unified framework, we enable interactive synthesis of human avatars, allowing users to synthesize novel views during novel pose animation.
By alternating the sampling between view and pose synthesis, we can generate synchronized multi-view videos of human performers from only a single image.
We demonstrate the capability of our method to synthesize view-consistent avatars from a single image.
We demonstrate the capability of our method to synthesize temporal-coherent avatars with realistic deformations from a single image.
We show the comparison with baselines on the task of novel view synthesis and novel pose animation.
@article{lu2025gas,
title={GAS: Generative Avatar Synthesis from a Single Image},
author={Lu, Yixing and Dong, Junting and Kwon, Youngjoong and Zhao, Qin and Dai, Bo and De la Torre, Fernando},
journal={arXiv preprint arXiv:2502.06957},
year={2025}
}