Endora: Video Generation Models as Endoscopy Simulators

Arxiv 2024

Chenxin Li^1, Hengyu Liu^1, Yifan Liu^1*, Brandon Y. Feng²,
Wuyang Li¹, Xinyu Liu¹, Zhen Chen³ Jing Shao⁴ Yixuan Yuan¹

¹The Chinese University of Hong Kong ²Massachusetts Institute of Technology
³Centre for Artificial Intelligence and Robotics of Hong Kong ⁴Shanghai Artificial Intelligence Laboratory

Paper

Code

Model

Video

Abstract

TL;DR:Endora enables the high-fidelity medical video generation on endoscopy scenes and demonstrates the versatile ability through successful applications in video-based disease diagnosis and 3D surgical scene reconstruction.

Generative models hold promise for revolutionizing medical education, robot-assisted surgery, and data augmentation for machine learning. Despite progress in generating 2D medical images, the complex domain of clinical video generation has largely remained untapped.

This paper introduces Endora, an innovative approach to generate medical videos to simulate clinical endoscopy scenes. We present a novel generative model design that integrates a meticulously crafted spatial-temporal video transformer with advanced 2D vision foundation model priors, explicitly modeling spatial-temporal dynamics during video generation.

We also pioneer the first public benchmark for endoscopy simulation with video generation models, adapting existing state-of-the-art methods for this endeavor. Endora demonstrates exceptional visual quality in generating endoscopy videos, surpassing state-of-the-art methods in extensive testing. Moreover, we explore how this endoscopy simulator can empower downstream video analysis tasks and even generate 3D medical scenes with multi-view consistency.

In a nutshell, Endora marks a notable breakthrough in the deployment of generative AI for clinical endoscopy research, setting a substantial stage for further advances in medical content generation.