PaperPulse logo
FeedTopicsAI Researcher FeedBlogPodcastAccount

Stay Updated

Get the latest research delivered to your inbox

Platform

  • Home
  • About Us
  • Search Papers
  • Research Topics
  • Researcher Feed

Resources

  • Newsletter
  • Blog
  • Podcast
PaperPulse•

AI-powered research discovery platform

© 2024 PaperPulse. All rights reserved.

TV2TV: A Unified Framework for Interleaved Language and Video Generation

arXivSource

Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan

cs.AI
|
Dec 4, 2025
5 views

One-line Summary

TV2TV is a new framework for video generation that interleaves text and video generation to improve visual quality and control by using a Mixture-of-Transformers architecture.

Plain-language Overview

The paper introduces TV2TV, a novel approach to video generation that combines text and video creation into one process. This method allows the model to think in words before creating video frames, which improves the quality and control of the videos generated. TV2TV uses a sophisticated architecture to decide when to switch between generating text and video, making it easier to align the video with the intended prompts. The model has shown significant improvements in creating complex video sequences, particularly in video game and sports scenarios.

Technical Details