About

Enxin Song is currently a master student at Zhejiang University. She is a research intern at New York University, working with Prof. Saining Xie, and previously conducted research as a visiting intern at the University of California, San Diego (UCSD) under the supervision of Prof. Zhuowen Tu. She will receive her M.S. degree in March 2026 from Zhejiang University, where she is advised by Prof. Gaoang Wang (CVNext Lab), and holds a B.S. in Software Engineering from Dalian University of Technology. Her work centers on video understanding, highlighted by MovieChat, the first Large Mutli-Modal Model for hour-long video understanding. She has co-organized workshops and challenges on video understanding at CVPR 2024 and 2025.

She is a highly self-motivated student applying to Ph.D. programs for 2026Fall. You can view her Curriculum Vitae.

Experiences

Feb. 2025 – Present

University of California, San Diego (UCSD), USA

Visiting Intern

Advised by Prof. Zhuowen Tu.

Nov. 2023 – May 2024

Media Computing Group, Microsoft Research Lab - Asia, Beijing

Research Intern

Worked on text-to-image generation.

Education

Sep. 2023 – Mar. 2026 (expected)

M.S., Zhejiang University, Hangzhou, China

Artificial Intelligence

Ranked 1/82 in the M.S. program.

Sep. 2019 – Jun. 2023

B.S., Dalian University of Technology (DLUT), Dalian, China

Software Engineering

Ranked 21/385 in the undergraduate cohort.

News

Nov 2025 Selected (top 10%) to give a talk at the KAUST Rising Stars in AI Symposium 2026.
Oct 2025 Video-MMLU received the Outstanding Paper Award at the ICCV 2025 Knowledge-Intensive Multimodal Reasoning Workshop, along with a travel grant.
Oct 2025 We release VideoNSA, a hardware-aware native sparse attention mechanism for video understanding.
Sep 2025 Invited talk at Lambda AI titled From Seeing to Thinking.
Sep 2025 One paper accepted by ICCV 2025 KnowledgeMR Workshop.
Aug 2025 Our paper MovieChat+: Question-aware Sparse Memory for Long Video Question Answering is accepted by IEEE TPAMI.

Selected Publications and Manuscripts

* Equal contribution.

Also see Google Scholar.

VideoNSA: Native Sparse Attention Scales Video Understanding

Enxin Song*, Wenhao Chai, Shusheng Yang, Ethan J. Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu

Preprint, 2025

VideoNSA delivers hardware-aware native sparse attention primitives for efficient video understanding systems.

AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding

Weili Xu*, Enxin Song*, Wenhao Chai, Xuexiang Wen, Tian Ye, Gaoang Wang,

ICCV, 2025

Paper

Video-MMLU uses a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states to solve long video understanding tasks.

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Wenhao Chai*, Enxin Song*, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning

ICLR, 2025

Paper Website Model Benchmark Code

AuroraCap is a multimodal LLM designed for image and video detailed captioning. We also release VDC, the first benchmark for detailed video captioning.

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Enxin Song*, Wenhao Chai*, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, Gaoang Wang

CVPR, 2024

Paper Website Benchmark Code

MovieChat achieves state-of-the-art performace in extra long video (more than 10K frames) understanding by introducing memory mechanism.

Professional Service

Conference & Journal Refereeing

NeurIPS 2025, PRCV 2023 & 2025, CVPR 2025, ICLR 2025 & 2026, TMM 2024, TPAMI 2025.

Workshop Organization

Teaching Assistant

Spring 2024

ECE 445 Senior Design (Undergraduate)

Teaching Assistant with Prof. Gaoang Wang.

Selected Honors & Awards

2026
KAUST Rising Stars in AI Symposium
2025
Lambda AI Cloud Credits Grant Sponsorship
2025
National Scholarship, Zhejiang University
2024
National Scholarship, Zhejiang University
2021
National Scholarship, Dalian University of Technology

Top