GAS: Generative Avatar Synthesis from a Single Image

GAS: Generative Avatar Synthesis

from a Single Image

1Carnegie Mellon University     2Shanghai AI Laboratory     3Stanford University    

TL;DR: We introduce a unified framework for generative avatar synthesis from a single image, featuring consistent view synthesis and realistic pose animation.

Abstract

We introduce a generalizable and unified framework to synthesize view-consistent and temporally coherent avatars from a single image, addressing the challenging problem of single-image avatar generation. While recent methods employ diffusion models conditioned on human templates like depth or normal maps, they often struggle to preserve appearance information due to the discrepancy between sparse driving signals and the actual human subject, resulting in multi-view and temporal inconsistencies. Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model. The dense driving signal from the initial reconstructed human provides comprehensive conditioning, ensuring high-quality synthesis faithful to the reference appearance and structure. Additionally, we propose a unified framework that enables the generalization learned from novel pose synthesis on in-the-wild videos to naturally transfer to novel view synthesis. Our video-based diffusion model enhances disentangled synthesis with high-quality view-consistent renderings for novel views and realistic non-rigid deformations in novel pose animation. Results demonstrate the superior generalization ability of our method across in-domain and out-of-domain in-the-wild datasets.

Method

From a single input human image, our approach leverages a generalizable human NeRF to map the subject to canonical space, followed by reposing and rendering to obtain the appearance cue (i.e., NeRF renderings). Paired with the geometry cue (i.e., SMPL normal maps), they provide comprehensive conditions for the video diffusion model, enabling multi-view consistent novel view synthesis or temporally consistent pose animation, effectively disentangled by a switcher module.

pipeline

Results (more coming soon)

Interactive view and pose synthesis

Leveraging the unified framework, we enable interactive synthesis of human avatars, allowing users to synthesize novel views during novel pose animation.

Novel view synthesis

We demonstrate the capability of our method to synthesize view-consistent avatars from a single image.

Novel pose animation

We demonstrate the capability of our method to synthesize temporal-coherent avatars with realistic deformations from a single image.

Comparison with baselines

We show the comparison with baselines on the task of novel view synthesis and novel pose animation.

Main Video

In this video, we demonstrate results for in-the-wild avatar synthesis, comparisons with baselines, ablation analysis and additional applications.

Acknowledgements

We appreciate the helpful feedback on the manuscript from Jialu Gao and Jianjin Xu. We thank Wenbo Gou and Yuhang Yang for thoughtful research discussions.