Temporally Consistent Object Editing in Videos using Extended Attention

1Concordia University, Montreal, Canada; 2Mila – Quebec AI Institute

IEEE/CFV CVPR Workshop on AI for Content Creation (AI4CC) 2024


*Corresponding Author

Video Presentation

Abstract

Image generation and editing have seen a great deal of advancements with the rise of large-scale diffusion models that allow user control of different modalities such as text, mask, depth maps, etc. However, controlled editing of videos still lags behind. Prior work in this area has focused on using 2D diffusion models to globally change the style of an existing video. On the other hand, in many practical applications, editing localized parts of the video is critical. In this work, we propose a method to edit videos using a pre-trained inpainting image diffusion model. We systematically redesign the forward path of the model by replacing the self-attention modules with an extended version of attention modules that creates frame-level dependencies. In this way, we ensure that the edited information will be consistent across all the video frames no matter what the shape and position of the masked area is. We qualitatively compare our results with state-of-the-art in terms of accuracy on several video editing tasks like object retargeting, object replacement, and object removal tasks. Simulations demonstrate the superior performance of the proposed strategy.

Methodology

We systematically redesign the forward path of the pre-trained inpainting diffusion model by replacing the self-attention modules with an extended version of attention modules that induces dependencies between frames. Note that we do not change the architecture of the existing U-Net in the SD model. We manipulate only the computation in the forward path of the self-attention layers. This happens by using several frames instead of one in the computation of self-attention modules to extract similar information or features. That is why our approach does not add any additional training or fine-tuning. Then, having these extracted similar features, we enforce the model to edit (reconstruct) the video in a way that the regions in the frames that have similar features will be kept unchanged while the other parts of the frames will be changed according to the control commands (mask and text prompts). In this way, we ensure that the edited information will be consistent across all the video frames no matter what the shape and position of the masked area is. Figure below (right) demonstrates a visual representation of how the forward path of the extended attention works in our framework. Figure below (left) shows the whole diffusion process of our temporal consistent video editing technique. More specifically, for each diffusion step, we randomly select several frames and their corresponding mask images. Then, these pairs of masks and images are fed into a pre-processor algorithm explained above. Then, the extended attention layers in the U-Net architecture, showing in figure below, extract similar features from the selected frames. This process is repeated for T=50 diffusion steps.

An overview of the diffusion process for temporal consistent video editing

Quantitative Comparison

To numerically evaluate our method, we follow the criteria used by several state-of-the-art approaches in the literature [1, 2]: Structural similarity (SSIM) index and peak signal-to-noise ratio (PSNR). These are methods for measuring the similarity between two images. The SSIM index can be viewed as a quality measure of one of the compared images, provided the other image is of perfect quality. We use both PSNR and SSIM metrics to evaluate the visual and perceptual similarity between input and output videos. More specifically, comparing our method with ProPainter [2] and E2FGvI [1] using the metrics mentioned above provides an insight into how similar the corresponding parts are in the input and output (edited) video frames. This is useful when we compare all the methods on the video object removal and video object retargeting task (explained in Section 3 of the main paper). Since in these two tasks it is desired to remove the masked-out areas and replace them with contents around the masked-out area in the input frames. As a result, it is important to know how effective the methods are in editing the masked-out object while keeping other areas in the input video unchanged.

 A quantitative comparison between our proposed method, ProPainter, and E2FGVI

Note that we do not compare the results of our method with Propainter and E2FGvI for the video object replacement task as those methods cannot perform those tasks. We achieve better results with all metrics in the video object retargeting task, as indicated in table above. The superiority of the proposed method can also be visually verified by the qualitative results in the video above. Regarding the results of the video object removal task, our method achieves results more or less the same level of fidelity compared to the state-of-the-art.

BibTeX

We appreciate your interest in our research. If you want to use our work, please consider the proper citation format written below.
@article{zamani2024temporally,
  title={Temporally Consistent Object Editing in Videos using Extended Attention},
  author={Zamani, AmirHossein and Aghdam, Amir G and Popa, Tiberiu and Belilovsky, Eugene},
  journal={arXiv preprint arXiv:2406.00272},
  year={2024}
}

References

[1] Li, Zhen, et al. "Towards an end-to-end framework for flow-guided video inpainting." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[2] Zhou, Shangchen, et al. "Propainter: Improving propagation and transformer for video inpainting." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.