DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation

1HKUST(GZ) 2HKUST 3FDU 4SZU 5Knowin
*Equal contribution
Teaser Image

Comparison with state-of-the-art methods: Using a single input image and the same target camera trajectory, our method closely follows the camera motion while delivering the best perceptual quality.

Open World Camera-Controlled Video Generation

Abstract

This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the SIGMA (SemantIc Guided Mutual Alignment) mechanism, which performs RGB–depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation with over 40% reduction in camera motion errors compared with prior methods.

Methods

Overall Framework

DualCamCtrl Framework
Overall architecture of DualCamCtrl. Dual-branch framework simultaneously generates RGB and depth video latents from an input image and its corresponding depth map. The two latents are element-wise added to the encoded Plücker embedding and concatenated with noise. Subsequently, the two modalities interact through our proposed SIGMA mechanism and fusion block. During training, both predictions are supervised by their respective loss functions.

SIGMA Fusion

Misalignment and SIGMA Insight
(a) Illustration of modality misalignment. Independent RGB and depth latent evolution leads to misalignment across frames, motivating the design of SIGMA strategy for cross-modal alignment.
(b) Comparison with one-way alignment. One-way alignment transfers information unidirectionally, leading to misalignment on local semantics.
(c) Comparison with geometry-guided alignment. Under geometry-guided setting, geometry cues evolve too quickly and become inconsistent with RGB motion.

Two-stage Training Pipeline

Our training objective is twofold: enable each modality to develop generative competence and foster effective cross-modal interaction. To achieve this, we adopt a carefully staged schedule that balances the learning dynamics of both modalities: 1. decoupled stage, where RGB and depth branches are trained independently to capture appearance and geometry cues separately; and 2. fusion stage, where cross-branch interactions are enabled through a fusion block to exploit complementary strengths.

In the decoupled stage, both branches are initialized from pretrained weights. Depth supervision is provided via state-of-the-art monocular depth estimation, and no cross-modal fusion is applied. This ensures each branch learns its respective cue without interference. Even early in this stage, the depth branch can interpret rough geometry as a hazy image and provide useful conditioning for RGB generation.

In the fusion stage, RGB and depth branches interact via the SIGMA-enabled fusion block. The RGB branch contributes rich appearance cues, while the depth branch provides geometric structure. The fusion block is zero-initialized to allow gradual influence, and the training objective remains the combination of RGB and depth losses.

Let \(\gamma \in \{0, 1\}\) indicate whether cross-branch fusion is enabled, where \(\gamma = 0\) corresponds to the decoupled stage and \(\gamma = 1\) corresponds to the fusion stage. We define the 3D-aware cross features \(h_t^{\mathrm{RGB \rightarrow D}}\) (from RGB to Depth) and \(h_t^{\mathrm{D \rightarrow RGB}}\) (from Depth to RGB), with the corresponding losses for each branch as follows:

\[ \mathcal{L}_{\mathrm{RGB}} = \mathbb{E}\Big[ \big\| v_t^{\mathrm{RGB}} - \theta_{\mathrm{RGB}}( z_t^{\mathrm{RGB}}, t, c, \mathbf{R}, \mathbf{t}; \gamma\, h_t^{\mathrm{D \rightarrow RGB}} ) \big\|^2 \Big] \]

\[ \mathcal{L}_{\mathrm{D}} = \mathbb{E}\left[ \left\| v_t^{\mathrm{D}} - \theta_{\mathrm{D}}\left( z_t^{\mathrm{D}}, t, c, \mathbf{R}, \mathbf{t}; \gamma\, h_t^{\mathrm{RGB \rightarrow D}} \right) \right\|^2 \right] \]

\[ \mathcal{L}_{\mathrm{Overall}} = \mathcal{L}_{\mathrm{RGB}} + \lambda \mathcal{L}_{\mathrm{D}} \]

Experiments

We evaluate our method on the I2V and T2V settings, comparing against state-of-the-art baselines.

I2V Quantitative Comparison
Comparison between our method and other state-of-the-art approaches. Given the same camera pose and input image as generation conditions, our method achieves the best alignment between camera motion and scene dynamics, producing the most visually accurate video. The ’+’ signs marked in the figure serve as anchors for visual comparison.
T2V Quantitative Comparison
Quantitative comparisons on I2V setting. ↑ / ↓ denotes higher/lower is better. Best and second best results highlighted.
I2V/T2V Comparison
Quantitative comparisons on T2V setting across REALESTATE10K and DL3DV.

Camera-Controlled Image to Video Generation

Camera-Controlled Text to Video Generation

Related Links

Our work builds upon several notable open-source projects, including:

DiffSynth – a diffusion-based video generation framework supporting both training and inference.

CameraCtrl and GenFusion – provide essential data processing pipelines and workflows.

BibTeX

If you find our work useful in your research, please consider citing us:
@article{zhang2025dualcamctrl,
  title={DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation},
  author={Zhang, Hongfei and Chen, Kanghao and Zhang, Zixin and Chen, Harold Haodong and Lyu, Yuanhuiyi and Zhang, Yuqi and Yang, Shuai and Zhou, Kun and Chen, Yingcong},
  journal={arXiv preprint arXiv:2511.23127},
  year={2025}
}