We present sample results of our method.
Existing methods of text-guided video editing suffer from temporal inconsistency.
| "Van Gogh Starry Night style, a bear" | Ours | Text-to-video ([1]) | TAV ([2]) |
|---|---|---|---|
| Gen1 ([3]) | Video-P2P ([4]) | TokenFlow ([5]) | Re-render a Video ([7]) |
| "Oil painting style, a dog" | Ours | Text-to-video ([1]) | TAV ([2]) |
|---|---|---|---|
| Gen1 ([3]) | Video-P2P ([4]) | TokenFlow ([5]) | Re-render a Video ([7]) |
| "A marble sculpture of a woman running" | Ours | Text-to-video ([1]) | TAV ([2]) |
|---|---|---|---|
| Gen1 ([3]) | Video-P2P ([4]) | TokenFlow ([5]) | Re-render a Video ([7]) |
Ablation studies on the respective effect of latent alignment, which constrains the query tokens, and cross-frame attention, which constraines the key and value tokens.
| Original video | Ours w/o cross-frame attention | Ours w/o latent alignment | Ours |
|---|---|---|---|
Ablation results with different hyperparameters, α and threshold, and optical flow extraction networks.
| α ranges from 4 to 6. | α=4 | α=5(Ours) | α=6 |
|---|---|---|---|
| Threshold ranges from 0.5 to 0.7 . | threshold=0.5 | threshold=0.6(Ours) | threshold=0.7 |
| Comparison between GMFlow and RAFT. | LatentWarp with GMFlow([8]) | LatentWarp with RAFT([9]) | |
Ablations on the first 16 steps and last 4 steps performing latent alignment.
| Original video | In the last 4 steps | In the first 16 steps |
|---|---|---|
Ablations on the effect of LatentWarp combined with vanilla Stable Diffusion model without ControlNet([6]).
| Original video | Stable Diffusion | Stable Diffusion + LatentWarp |
|---|---|---|
[1] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
[2] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022
[3] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023
[4] Liu, S., Zhang, Y., Li, W., Lin, Z., & Jia, J. (2023). Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04
[5] Geyer, Michal, et al. "Tokenflow: Consistent diffusion features for consistent video editing." arXiv preprint arXiv:2307.10373 (2023)
[6] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023.
[7] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to- video translation, 2023.
[8] Xu, Haofei, et al. "Gmflow: Learning optical flow via global matching." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[9] Teed, Zachary, and Jia Deng. "Raft: Recurrent all-pairs field transforms for optical flow." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer International Publishing, 2020.