FaceLift: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads

Abstract

We present FaceLift, a novel feed-forward approach for generalizable, high-quality 360-degree head reconstruction from a single image. Our pipeline first employs a multi-view latent diffusion model to generate consistent side and back views from a single facial input, which then feed into a transformer-based reconstructor that produces a comprehensive 3D Gaussian Splats representation. Previous methods for monocular 3D face reconstruction often lack full view coverage or view consistency due to insufficient multi-view supervision. We address this by creating a high-quality synthetic head dataset that enables consistent supervision across viewpoints. To bridge the domain gap between synthetic training data and real-world images, we propose a simple yet effective technique that ensures the view-generation process maintains fidelity to the input by learning to reconstruct the input image alongside view generation. Despite being trained exclusively on synthetic data, our method demonstrates remarkable generalization to real-world images. Through extensive qualitative and quantitative evaluations, we show that FaceLift outperforms state-of-the-art 3D face reconstruction methods in identity preservation, detail recovery, and rendering quality.

Method

Overview of FaceLift. Given a single image of a human face as input, we train an image-conditioned, multi-view diffusion model to generate novel views covering the entire head. By reconstructing the input image and leveraging high-quality synthetic data, our multi-view latent diffusion model can hallucinate unseen views of the human head with high-fidelity and multi-view consistency. We then train a transformer-based reconstructor, which takes multi-view images and their camera poses as input and generates 3D Gaussian Splats to represent the human head.

Results

Single Image to 3D Head

(Click to see more results)

FaceLift lifts a single facial image to a detailed 3D reconstruction with preserved identity features.

Open Interactive Viewer

Video as Input for 4D Novel View Synthesis

Given a video as input, FaceLift processes each frame individually and generates 3D Gaussian sequence, which enables 4D novel view synthesis.

Input Video

4D Rendering Results

FaceLift can be combined with 2D face animation methods like LivePortrait to achieve 3D face animation.

Input Image

2D Animation (by LivePortrait)

3D Animation

Input Image

2D Animation (by LivePortrait)

3D Animation

BibTeX

@misc{lyu2024facelift,
      title={FaceLift: Single Image to 3D Head with View Generation and GS-LRM}, 
      author={Weijie Lyu and Yi Zhou and Ming-Hsuan Yang and Zhixin Shu},
      year={2024},
      eprint={2412.17812},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.17812}
      }

FaceLift: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads

FaceLift transforms a single facial image into a high-fidelity 3D Gaussian head representation. Trained exclusively on synthetic 3D data, our pipeline produces a complete and detailed 3D head representation that generalizes remarkably well to real-world human images.

Abstract

Method

Results

Single Image to 3D Head

(Click to see more results)

Open Interactive Viewer

Open Interactive Viewer

Open Interactive Viewer

Open Interactive Viewer

Open Interactive Viewer

Open Interactive Viewer

Video as Input for 4D Novel View Synthesis

Input Video

4D Rendering Results

Input Image

2D Animation (by LivePortrait)

3D Animation

Input Image

2D Animation (by LivePortrait)

3D Animation

BibTeX