*Denotes Equal Contribution
Overview. Our method performs feed-forward novel-view synthesis from a series of input images, such as the pairs shown above. We demonstrate strong results in terms of quality and generalization capacity, performing well across a variety of common novel- view synthesis datasets, including scenes that are out-of-distribution.
Large transformer-based models have made significant progress in generalizable novel view synthesis (NVS) from sparse input views, generating novel viewpoints without the need for test-time optimization. However, these models are constrained by the limited diversity of publicly available scene datasets, making most real-world (in-the-wild) scenes out-of-distribution. To overcome this, we incorporate synthetic training data generated from diffusion models, which improves generalization across unseen domains. While synthetic data offers scalability, we identify artifacts introduced during data generation as a key bottleneck affecting reconstruction quality. To address this, we propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning. This refinement not only improves reconstruction quality over standard transformers but also enables scalable training with synthetic data. As a result, our method outperforms existing models on both in-dataset and cross-dataset evaluations, achieving state-of-the-art results across multiple benchmarks while significantly reducing computational costs.
A visualization of the principal components of transformer layer outputs for source and target of LVSM. The 24 images in each subfigure show the layer output of each layer of the transformer. LVSM features for source and target images looks similar even though the source is conditioned with image and Plücker coordinates and target is conditioned with Plücker coordinates alone. This leads to inefficient transformer usage requiring explicit alignment of source and target features across different layers.
An illustration of the architecture. We use CAT3D, a multi-view diffusion model, to generate synthetic views conditioned on random spline camera trajectories and a random image. From the two random views form the generated views as the source views and the input conditioning view to be the target of our large reconstruction network. Our large reconstruction model uses a special transformer block which we name Tok-D Transformer. When real data is available, we just use the reconstruction transformer.
An illustration of the Tok-D transformer block -- which differentiates between source and target tokens. The Tok-D transformer modulates the input to all blocks, while an enhanced version also modulates the attention and MLP layers.
Quantitative comparisons for in-distribution scene synthesis at 256p resolution. Our method outperforms the previous SOTA method across all existing datasets.
Method | Venue | RealEstate 10k | ACID | DL3DV | ||||||
---|---|---|---|---|---|---|---|---|---|---|
PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ | ||
GPNR | CVPR 23 | 24.11 | 0.793 | 0.255 | 25.28 | 0.764 | 0.332 | - | - | - |
PixelSplat | CVPR 24 | 25.89 | 0.858 | 0.142 | 28.14 | 0.839 | 0.533 | - | - | - |
MVSplat | ECCV 25 | 26.39 | 0.869 | 0.128 | 28.25 | 0.843 | 0.144 | 17.54 | 0.529 | 0.402 |
DepthSplat | CVPR 25 | 27.44 | 0.887 | 0.119 | - | - | - | 19.05 | 0.610 | 0.313 |
LVSM | ICLR 25 | 28.89 | 0.894 | 0.108 | 29.19 | 0.836 | 0.095 | 19.91 | 0.600 | 0.273 |
Ours | ICCV 25 | 30.02 | 0.919 | 0.058 | 29.47 | 0.846 | 0.086 | 21.55 | 0.643 | 0.208 |
@inproceedings{placeholder2025scaling,
title={Scaling Transformer-Based Novel View Synthesis with Token Disentanglement and Synthetic Data},
author={Nair, Nithin Gopalakrishnan and Kaza, Srinivas and Luo, Xuan and Patel, Vishal M. and Lombardi, Stephen and Park, Jungyeon},
booktitle={International Conference on Computer Vision (ICCV)},
year={2025}
}