Yuwei Fang

I am a Principal AI Scientist at Zoom AI, where I focus on Agentic Memory and LLM Post-training.

Currently, I am passionate about building the next generation platform where humans and agents collaborate seamlessly—from incubating ideas to enterprise-scale automation.

Before joining Zoom, I was at Snap Research and Microsoft Azure AI, where I worked on multimodal generation, video synthesis, and knowledge-grounded language models.

studyfang AT gmail.com | Redmond, Washington

Google Scholar / LinkedIn / Github / Twitter

Selected Publications

Evaluating Very Long-Term Conversational Memory of LLM Agents

ACL 2024

LoCoMo: A comprehensive benchmark for evaluating long-term conversational memory capabilities of large language model agents across extended dialogues.

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

CVPR 2024

A large-scale dataset of 70 million video clips with high-quality captions generated using multiple cross-modality teacher models for video-language understanding.

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

CVPR 2024

A scalable transformer architecture for high-quality text-to-video generation, enabling controllable and coherent video synthesis from textual descriptions.

VIMI: Grounding Video Generation through Multi-modal Instruction

EMNLP 2024

A framework for controllable video generation using multi-modal instructions to ground the generation process with precise spatial and temporal control.

i-Code: An Integrative and Composable Multimodal Learning Framework

AAAI 2023

A unified framework for integrative multimodal learning across vision, language, and speech modalities with composable pre-training objectives.

Hierarchical Graph Network for Multi-hop Question Answering

EMNLP 2020

A hierarchical graph neural network approach for multi-hop reasoning in question answering, achieving state-of-the-art results on HotpotQA.

View all publications →