This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the SIGMA (SemantIc Guided Mutual Alignment) mechanism, which performs RGB–depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation with over 40% reduction in camera motion errors compared with prior methods.
Our training objective is twofold: enable each modality to develop generative competence and foster effective cross-modal interaction. To achieve this, we adopt a carefully staged schedule that balances the learning dynamics of both modalities: 1. decoupled stage, where RGB and depth branches are trained independently to capture appearance and geometry cues separately; and 2. fusion stage, where cross-branch interactions are enabled through a fusion block to exploit complementary strengths.
In the decoupled stage, both branches are initialized from pretrained weights. Depth supervision is provided via state-of-the-art monocular depth estimation, and no cross-modal fusion is applied. This ensures each branch learns its respective cue without interference. Even early in this stage, the depth branch can interpret rough geometry as a hazy image and provide useful conditioning for RGB generation.
In the fusion stage, RGB and depth branches interact via the SIGMA-enabled fusion block. The RGB branch contributes rich appearance cues, while the depth branch provides geometric structure. The fusion block is zero-initialized to allow gradual influence, and the training objective remains the combination of RGB and depth losses.
Let \(\gamma \in \{0, 1\}\) indicate whether cross-branch fusion is enabled, where \(\gamma = 0\) corresponds to the decoupled stage and \(\gamma = 1\) corresponds to the fusion stage. We define the 3D-aware cross features \(h_t^{\mathrm{RGB \rightarrow D}}\) (from RGB to Depth) and \(h_t^{\mathrm{D \rightarrow RGB}}\) (from Depth to RGB), with the corresponding losses for each branch as follows:
\[ \mathcal{L}_{\mathrm{RGB}} = \mathbb{E}\Big[ \big\| v_t^{\mathrm{RGB}} - \theta_{\mathrm{RGB}}( z_t^{\mathrm{RGB}}, t, c, \mathbf{R}, \mathbf{t}; \gamma\, h_t^{\mathrm{D \rightarrow RGB}} ) \big\|^2 \Big] \]
\[ \mathcal{L}_{\mathrm{D}} = \mathbb{E}\left[ \left\| v_t^{\mathrm{D}} - \theta_{\mathrm{D}}\left( z_t^{\mathrm{D}}, t, c, \mathbf{R}, \mathbf{t}; \gamma\, h_t^{\mathrm{RGB \rightarrow D}} \right) \right\|^2 \right] \]
\[ \mathcal{L}_{\mathrm{Overall}} = \mathcal{L}_{\mathrm{RGB}} + \lambda \mathcal{L}_{\mathrm{D}} \]
We evaluate our method on the I2V and T2V settings, comparing against state-of-the-art baselines.
Our work builds upon several notable open-source projects, including:
DiffSynth – a diffusion-based video generation framework supporting both training and inference.
CameraCtrl and GenFusion – provide essential data processing pipelines and workflows.
@article{zhang2025dualcamctrl,
title={DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation},
author={Zhang, Hongfei and Chen, Kanghao and Zhang, Zixin and Chen, Harold Haodong and Lyu, Yuanhuiyi and Zhang, Yuqi and Yang, Shuai and Zhou, Kun and Chen, Yingcong},
journal={arXiv preprint arXiv:2511.23127},
year={2025}
}