Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan
TV2TV is a new framework for video generation that interleaves text and video generation to improve visual quality and control by using a Mixture-of-Transformers architecture.
The paper introduces TV2TV, a novel approach to video generation that combines text and video creation into one process. This method allows the model to think in words before creating video frames, which improves the quality and control of the videos generated. TV2TV uses a sophisticated architecture to decide when to switch between generating text and video, making it easier to align the video with the intended prompts. The model has shown significant improvements in creating complex video sequences, particularly in video game and sports scenarios.