avatar logo

AvatarArtist: Open-Domain 4D Avatarization

CVPR 2025

Hongyu Liu1,2        Xuan Wang2        Ziyu Wan3       Yue Ma1       Jingye Chen1       Yanbo Fan2       Yujun Shen2       Yibing Song       Qifeng Chen1      
1HKUST        2AntGroup        3City University of HongKong      

TL;DR: Our model can create a 4D avatar using a single open domain reference image.

Overview


Our approach consists of two steps during the inference process. First, the DiT model generates a 4D representation based on the input image. Then, our Motion-Aware Cross-Domain Renderer takes this 4D representation as input and, guided by both the input image and driving signals, renders it into the final target image.

Open Domain Paired Image and 4D Representation Dataset Generation

method figure

Overview of the pipeline for dataset generation. We utilize text prompts to transform images from the realistic domain to the target domain. Meanwhile, SDEdit and landmark-guided ControlNet ensure that the generated images preserve the same pose and expression as the source images. This enables us to directly reuse the mesh from the original domain, avoiding inaccuracies in mesh extraction for non-realistic domains. After domain transfer, we train 4D GANs for different domains to generate image-parametric triplane pairs, which serve as data for the subsequent generation model. The parametric triplane consists of dynamic and static components, with the dynamic region aligned to the mesh.

DiT and Motion-Aware Cross-Domain Renderer

method figure

DiT and Render Architecture. We first train a VAE to compress the parametric triplane into a latent space, followed by training a DiT to denoise the noisy latent. We integrate features from DINO and CLIP into the DiT to guide the generation process. For the motion-aware cross-domain renderer, we utilize an encoder to extract features from the source image. These extracted features are then passed to a ViT, which predicts results under the guidance of the generated parametric triplane and motion embedding. Finally, a decoder processes the ViT output and fuses it with the rasterization results to generate the final output.

Generation Gallery

We showcase a variety of stylistic outputs generated by AvatarArtist, demonstrating its capability to produce diverse styles across different domains.

We present a variety of avatars generated using AvatarArtist from input images across different domains. Our method demonstrates the ability to handle inputs from diverse domains, such as lego, zombies, and more.

The reference input images are displayed in the first row, while the corresponding 4D avatars are shown in the second row.

Baseline Comparisons

We conduct experiments on the cross-reenactment and self-reenactment tasks. For quanitative results, we refer to our paper.

Animating edited portrait images

We can first edit the input images using off-the-shelf image editing models, and then use AvatarArtist to generate 4D avatars based on the edited inputs.

method figure

Reference images


AvatarArtist avatar

BibTeX

 @inproceedings{liu2025avatarartist,
    author    = {Hongyu Liu, Xuan Wang, Ziyu Wan, Yue Ma, Jingye Chen, Yanbo Fan, Yujun Shen, Yibing Song, Qifeng Chen },
    title     = {AvatarArtist: Open-Domain 4D Avatarization},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year      = {2025}}
    
-->