LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation

Supplementary Material

 

 


Our Results

We present sample results of our method.

Input video "A husky" "A bronze sculpture" "A pig"
Input video "An ice sculpture of a car" "A sand sculpture of a car" "A pink car"
Input video "Van-Gogh style, a bus" "A bus on the road, snowy day"
Input video "Van Gogh Starry Night style, a bear" "Soft painting, art by hidari and krenz cushart, a bear"
Input video "Van Gogh Starry Night style, a swan" "A white swan, in the ice and snow"
Input video "Van Gogh style, flamingo" "Flamingo in the space"
Input video "Van Gogh Starry Night style, gold-fish" "Dark sea, fishes"

 


Comparisons to Baselines

Existing methods of text-guided video editing suffer from temporal inconsistency.

Our method manages to preserve the video consistency after translation while fulfilling the target text.

"Van Gogh Starry Night style, a bear" Ours Text-to-video ([1]) TAV ([2])
Gen1 ([3]) Video-P2P ([4]) TokenFlow ([5]) Re-render a Video ([7])

"Oil painting style, a dog" Ours Text-to-video ([1]) TAV ([2])
Gen1 ([3]) Video-P2P ([4]) TokenFlow ([5]) Re-render a Video ([7])

"A marble sculpture of a woman running" Ours Text-to-video ([1]) TAV ([2])
Gen1 ([3]) Video-P2P ([4]) TokenFlow ([5]) Re-render a Video ([7])

 


Ablations

Ablation studies on the respective effect of latent alignment, which constrains the query tokens, and cross-frame attention, which constraines the key and value tokens.


Original video Ours w/o cross-frame attention Ours w/o latent alignment Ours

Ablation results with different hyperparameters, α and threshold, and optical flow extraction networks.


α ranges from 4 to 6. α=4 α=5(Ours) α=6
Threshold ranges from 0.5 to 0.7 . threshold=0.5 threshold=0.6(Ours) threshold=0.7
Comparison between GMFlow and RAFT. LatentWarp with GMFlow([8]) LatentWarp with RAFT([9])

Ablations on the first 16 steps and last 4 steps performing latent alignment.


Original video In the last 4 steps In the first 16 steps

Ablations on the effect of LatentWarp combined with vanilla Stable Diffusion model without ControlNet([6]).


Original video Stable Diffusion Stable Diffusion + LatentWarp


 

 

 

References

[1] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.

[2] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022

[3] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023

[4] Liu, S., Zhang, Y., Li, W., Lin, Z., & Jia, J. (2023). Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04

[5] Geyer, Michal, et al. "Tokenflow: Consistent diffusion features for consistent video editing." arXiv preprint arXiv:2307.10373 (2023)

[6] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023.

[7] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to- video translation, 2023.

[8] Xu, Haofei, et al. "Gmflow: Learning optical flow via global matching." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

[9] Teed, Zachary, and Jia Deng. "Raft: Recurrent all-pairs field transforms for optical flow." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer International Publishing, 2020.