Scaling Transformer-Based Novel View Synthesis with Token Disentanglement and Synthetic Data

ICCV 2025

*Denotes Equal Contribution

¹Johns Hopkins University ²Google

Overview. Our method performs feed-forward novel-view synthesis from a series of input images, such as the pairs shown above. We demonstrate strong results in terms of quality and generalization capacity, performing well across a variety of common novel- view synthesis datasets, including scenes that are out-of-distribution.

Abstract

Large transformer-based models have made significant progress in generalizable novel view synthesis (NVS) from sparse input views, generating novel viewpoints without the need for test-time optimization. However, these models are constrained by the limited diversity of publicly available scene datasets, making most real-world (in-the-wild) scenes out-of-distribution. To overcome this, we incorporate synthetic training data generated from diffusion models, which improves generalization across unseen domains. While synthetic data offers scalability, we identify artifacts introduced during data generation as a key bottleneck affecting reconstruction quality. To address this, we propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning. This refinement not only improves reconstruction quality over standard transformers but also enables scalable training with synthetic data. As a result, our method outperforms existing models on both in-dataset and cross-dataset evaluations, achieving state-of-the-art results across multiple benchmarks while significantly reducing computational costs.

PCA visualization of source and target features

A visualization of the principal components of transformer layer outputs for source and target of LVSM. The 24 images in each subfigure show the layer output of each layer of the transformer. LVSM features for source and target images looks similar even though the source is conditioned with image and Plücker coordinates and target is conditioned with Plücker coordinates alone. This leads to inefficient transformer usage requiring explicit alignment of source and target features across different layers.

Method Overview

An illustration of the architecture. We use CAT3D, a multi-view diffusion model, to generate synthetic views conditioned on random spline camera trajectories and a random image. From the two random views form the generated views as the source views and the input conditioning view to be the target of our large reconstruction network. Our large reconstruction model uses a special transformer block which we name Tok-D Transformer. When real data is available, we just use the reconstruction transformer.

Token-Disentangled (Tok-D) Transformer Block

Diagram of the Tok-D transformer block architecture

An illustration of the Tok-D transformer block -- which differentiates between source and target tokens. The Tok-D transformer modulates the input to all blocks, while an enhanced version also modulates the attention and MLP layers.

Quantitative Results

Quantitative comparisons for in-distribution scene synthesis at 256p resolution. Our method outperforms the previous SOTA method across all existing datasets.

Method	Venue	RealEstate 10k			ACID			DL3DV
Method	Venue	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
GPNR	CVPR 23	24.11	0.793	0.255	25.28	0.764	0.332	-	-	-
PixelSplat	CVPR 24	25.89	0.858	0.142	28.14	0.839	0.533	-	-	-
MVSplat	ECCV 25	26.39	0.869	0.128	28.25	0.843	0.144	17.54	0.529	0.402
DepthSplat	CVPR 25	27.44	0.887	0.119	-	-	-	19.05	0.610	0.313
LVSM	ICLR 25	28.89	0.894	0.108	29.19	0.836	0.095	19.91	0.600	0.273
Ours	ICCV 25	30.02	0.919	0.058	29.47	0.846	0.086	21.55	0.643	0.208

BibTeX

@inproceedings{placeholder2025scaling,
  title={Scaling Transformer-Based Novel View Synthesis with Token Disentanglement and Synthetic Data},
  author={Nair, Nithin Gopalakrishnan and Kaza, Srinivas and Luo, Xuan and Patel, Vishal M. and Lombardi, Stephen and Park, Jungyeon},
  booktitle={International Conference on Computer Vision (ICCV)},
  year={2025}
}