SViM3D

Stable Video Material Diffusion for Single Image 3D Generation

University of TübingenStability AI
Teaser

SViM3Dgenerates multi-view consistent physically based rendering (PBR) materials from a single image.

We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets.

Overview

Stable Video Materials 3D (SViM3D) is a probabilistic generative diffusion model that tackles object-centric inverse rendering from a single image. Conditioned on a camera pose sequence it generates both high-quality appearances as well as the corresponding multi-view consistent material properties. Unlike prior approaches that decouple material estimation from 3D reconstruction, SViM3D is the first camera-controllable multi-view model that can produce fully spatially-varying PBR parameters, RGB color and surface normals simultaneously. The additional output can be leveraged in various applications, hence we consider SViM3D a foundational model that provides a unified neural prior for both 3D reconstruction and material understanding. SViM3D’s output can be used to relight the views directly, perform material edits or to generate full 3D assets by lifting the multi-view material parameters to 3D. As 3D training data paired with material parameters is scarce, we leverage the accumulated world knowledge of a latent video diffusion model. Specifically, we adapt SV3D [1] a video diffusion model fine-tuned for camera-control by incorporating several crucial modifications:

  • Multi-illumination multi-view training dataset: We render a high-quality photorealistic synthetic dataset, capturing the complexity of real-world lighting and material variations.
  • Material latent representation: We treat the material parameters and surface normals as images reusing the image-based autoencoder to encode all inputs into unified latents.
  • Adapted UNet architecture: We make crucial changes in the core architecture and training scheme to smoothly adapt from image to image+material+normal generation.

Method

The figure below outlines our pipeline which can be divided into the material video generation and the reconstruction part.

Pipeline

Material video generationWe reuse the pre-trained VAE to define the latent encoding for the additional material channels. The denoising Unet is fine-tuned to process the new set of in- and outputs keeping the conditioning from the video model (SV3D) intact. For efficient training we employ a staged adaption using our new material dataset and several heuristics to filter the available object data w.r.t. the material quality.

ReconstructionThe output videos can then be used as pseudo ground truth in a 3D optinization framework which we build on our efficient illumination representation together with our novel mechanism to improve robustness on inconcistent generations. The reconstruction includes the following steps:

  • An illumination representation is pre-optimized using the orbital video M to initialize Phase 1.
  • A modified Instant-NGP is optimized using a photometric rendering loss relying on the jointly optimized illumination and supervision from the reference views.
  • We optimize a DMTet representation initialized from the results of Phase 1 via marching cubes.
  • A mesh is finally extracted, UV unwrapped using xatlas and all textures baked.

Results

The parametric material model allows for direct relighting and editing in image space. Try for yourself using the demp below.

ApplicationsWe show an example from our Poly Haven [2] test dataset. Using our multi-view consistent material video we can relight novel views based on the camera conditioning and our efficient illumination representation.

Comparison to Prior WorksNote the improved consistency and level of detail of our conditioned multi-view material generation compared to other prior works.

Examples taken from the Poly Haven [2] test dataset.

Comparison

DemoUpload your image and environment map or use one of the examples to try the PBR material generation and 2D relighting pipeline using the demo here.

Citation

@inproceedings{engelhardt2025-svim3d, author ={Engelhardt, Andreas and Boss, Mark and Voleti, Vikram and Yao, Chun-Han and Lensch, Hendrik P.A. and Jampani, Varun},title ={{SViM3D}: Stable Video Material Diffusion for Single Image 3D Generation},booktitle ={ICCV},year ={2025}}

Acknowledgements

This work has been partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC number 2064/1 – Project number 390727645 and SFB 1233, TP 02 - Project number 276693517. It was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A.

References

[1] Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitrii Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion. ECCV, 2024.

[2] Poly Haven — polyhaven.com. https://polyhaven.com/, 2024. [Accessed 22-08-2024].