Scaling Transformer-Based Novel View Synthesis with Token Disentanglement and Synthetic Data

ICCV 2025

*Denotes Equal Contribution

¹Johns Hopkins University     ²Google
Teaser Figure

Overview. Our method performs feed-forward novel-view synthesis from a series of input images, such as the pairs shown above. We demonstrate strong results in terms of quality and generalization capacity, performing well across a variety of common novel- view synthesis datasets, including scenes that are out-of-distribution.

Abstract

Large transformer-based models have made significant progress in generalizable novel view synthesis (NVS) from sparse input views, generating novel viewpoints without the need for test-time optimization. However, these models are constrained by the limited diversity of publicly available scene datasets, making most real-world (in-the-wild) scenes out-of-distribution. To overcome this, we incorporate synthetic training data generated from diffusion models, which improves generalization across unseen domains. While synthetic data offers scalability, we identify artifacts introduced during data generation as a key bottleneck affecting reconstruction quality. To address this, we propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning. This refinement not only improves reconstruction quality over standard transformers but also enables scalable training with synthetic data. As a result, our method outperforms existing models on both in-dataset and cross-dataset evaluations, achieving state-of-the-art results across multiple benchmarks while significantly reducing computational costs.

PCA visualization of source and target features

PCA visualization

A visualization of the principal components of transformer layer outputs for source and target of LVSM. The 24 images in each subfigure show the layer output of each layer of the transformer. LVSM features for source and target images looks similar even though the source is conditioned with image and Plücker coordinates and target is conditioned with Plücker coordinates alone. This leads to inefficient transformer usage requiring explicit alignment of source and target features across different layers.

Method Overview

Method Overview

An illustration of the architecture. We use CAT3D, a multi-view diffusion model, to generate synthetic views conditioned on random spline camera trajectories and a random image. From the two random views form the generated views as the source views and the input conditioning view to be the target of our large reconstruction network. Our large reconstruction model uses a special transformer block which we name Tok-D Transformer. When real data is available, we just use the reconstruction transformer.

Token-Disentangled (Tok-D) Transformer Block

Diagram of the Tok-D transformer block architecture

An illustration of the Tok-D transformer block -- which differentiates between source and target tokens. The Tok-D transformer modulates the input to all blocks, while an enhanced version also modulates the attention and MLP layers.

Quantitative Results

Quantitative comparisons for in-distribution scene synthesis at 256p resolution. Our method outperforms the previous SOTA method across all existing datasets.

Method Venue RealEstate 10k ACID DL3DV
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
GPNR CVPR 23 24.11 0.793 0.255 25.28 0.764 0.332 - - -
PixelSplat CVPR 24 25.89 0.858 0.142 28.14 0.839 0.533 - - -
MVSplat ECCV 25 26.39 0.869 0.128 28.25 0.843 0.144 17.54 0.529 0.402
DepthSplat CVPR 25 27.44 0.887 0.119 - - - 19.05 0.610 0.313
LVSM ICLR 25 28.89 0.894 0.108 29.19 0.836 0.095 19.91 0.600 0.273
Ours ICCV 25 30.02 0.919 0.058 29.47 0.846 0.086 21.55 0.643 0.208

BibTeX

@inproceedings{placeholder2025scaling,
  title={Scaling Transformer-Based Novel View Synthesis with Token Disentanglement and Synthetic Data},
  author={Nair, Nithin Gopalakrishnan and Kaza, Srinivas and Luo, Xuan and Patel, Vishal M. and Lombardi, Stephen and Park, Jungyeon},
  booktitle={International Conference on Computer Vision (ICCV)},
  year={2025}
}