Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce MedVidBench, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation.
While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce MedGRPO, a novel RL framework for balanced multi-dataset training with two key innovations: (1) cross-dataset reward normalization that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a medical LLM judge that evaluates caption quality on five clinical dimensions through comparative similarity scoring.
Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench substantially outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, while MedGRPO further improves upon the SFT baseline across grounding and captioning tasks. Our work establishes a foundational benchmark and robust training methodology for advancing vision-language models in medical domains.
531,850 video-instruction pairs across 8 medical sources and 8 tasks. Multi-perspective quality assurance pipeline with expert-guided prompting and dual-model validation (GPT-4.1 + Gemini-2.5-Flash).
Fine-tuning Qwen2.5-VL-7B on MedVidBench establishes a strong baseline that outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks.
Logistic normalization maps each dataset's median performance to a fixed reward value (0.5), ensuring fair optimization across heterogeneous datasets and preventing training collapse.
GPT-4.1-based comparative evaluation across five clinical dimensions: medical terminology, instrument/anatomy ID, specificity, clinical context, and action accuracy.
Two technical contributions that enable stable multi-dataset RL training
Standard RL fails on heterogeneous medical datasets due to vastly different difficulty levels (e.g., CoPESD STG median mIoU ~0.5 vs EgoSurgery STG ~0.12). We introduce logistic normalization that maps each dataset's median performance to a fixed reward value (0.5):
Benefits: (1) Median fairness — equal rewards at median performance across all datasets, (2) Smooth gradients — non-zero derivatives everywhere, (3) Outlier robustness — IQR-based scaling, (4) Bounded output — compatible with GRPO group normalization.
Standard semantic similarity fails to capture clinical correctness. For example, "The tool grasps tissue in the upper area" vs "The grasper dissects the cystic duct in the upper right quadrant" achieve high similarity (~0.82) but differ critically in medical accuracy.
Our GPT-4.1-based judge uses comparative similarity scoring across five clinical dimensions:
Hybrid design: Combine LLM judge (50%) with semantic similarity (50%) for robust evaluation capturing both clinical correctness and overall coherence.
A comprehensive benchmark for medical video understanding
MedVidBench systematically transforms existing expert annotations—bounding boxes, procedure transcripts, action labels—into instruction-following format through a multi-perspective quality assurance pipeline with dual-model validation (GPT-4.1 + Gemini-2.5-Flash).
Two versions: Large-Scale (531K samples, maximum data) for scaling experiments, and Standard (51K samples, task-balanced) for efficient multi-task learning.
■ Laparoscopic ■ Open Surgery ■ Robotic Surgery ■ Nursing
Distribution across 532K QA instances: (Left) Answer length from 1 to 1,170 words — short answers (≤5 words, 28.1%) from temporal grounding, long answers (>20 words, 51.8%) from captioning tasks. (Middle) Video durations from 20s to 1,800s with a long-tail pattern. (Right) FPS distribution — 0.5 FPS (63.3%), 1.0 FPS (22.0%), 2.0 FPS (7.5%).
MedGRPO achieves consistent gains over SFT baseline across all tasks
| Model | CVSacc | NAPacc | SAacc | STGmIoU | TAG@0.3 | TAG@0.5 | DVCllm | DVCF1 | VSllm | RCllm |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4.1 | 0.018 | 0.250 | 0.087 | 0.014 | 0.096 | 0.005 | 2.438 | 0.101 | 2.490 | 2.080 |
| Gemini-2.5-Flash | 0.101 | 0.228 | 0.107 | 0.047 | 0.045 | 0.021 | 2.387 | 0.084 | 2.352 | 1.912 |
| VideoChat-R1.5-7B | 0.000 | 0.270 | 0.006 | 0.000 | 0.009 | 0.005 | 1.723 | 0.026 | 3.034 | 3.086 |
| Qwen2.5VL-7B | 0.105 | 0.151 | 0.010 | 0.020 | 0.006 | 0.068 | 2.512 | 0.075 | 2.452 | 2.090 |
| Qwen2.5VL-7BSurg-CholecT50 (NVIDIA) | 0.000 | 0.302 | 0.000 | 0.000 | 0.019 | 0.013 | 1.945 | 0.051 | 2.101 | 2.986 |
| Qwen2.5VL-7BSFT (Ours) | 0.894 | 0.442 | 0.218 | 0.177 | 0.142 | 0.091 | 3.665 | 0.165 | 3.596 | 2.757 |
| Qwen2.5VL-7BMedGRPO (Ours) | 0.914 | 0.427 | 0.244 | 0.202 | 0.216 | 0.156 | 3.797 | 0.210 | 4.184 | 3.442 |
| Qwen3VL-4B | 0.000 | 0.178 | 0.006 | 0.000 | 0.039 | 0.034 | 1.939 | 0.128 | 2.926 | 2.853 |
| Qwen3VL-4BSFT (Ours) | 0.895 | 0.466 | 0.270 | 0.170 | 0.465 | 0.403 | 3.862 | 0.435 | 4.180 | 3.752 |
| Qwen3VL-4BMedGRPO (Ours) | 0.898 | 0.473 | 0.285 | 0.212 | 0.504 | 0.441 | 3.950 | 0.491 | 4.227 | 3.861 |
■ Best ■ 2nd Best Accuracy (CVS/NAP/SA) · mIoU (STG/TAG) · LLM Judge (DVC/VS/RC) · F1 (DVC)
MedGRPO generates clinically accurate descriptions across diverse tasks
MedGRPO produces precise instrument identification, accurate spatial localization, and clinical context understanding compared to GPT-4.1, Gemini-2.5-Flash, and the SFT baseline.
Ground Truth: "The grasper consistently grips and retracts the gallbladder towards the top left of the surgical field, providing counter-traction and exposure."
GPT-4.1: Generic descriptions without specific instruments.
Gemini-2.5-Flash: Misidentifies tool as "electrocautery hook" with incorrect actions.
SFT Baseline: Identifies "grasper" but uses vague spatial terms ("right side").
MedGRPO (Ours): "The grasper, positioned primarily on the upper left, steadily holds and maintains exposure of the surgical field, retracting the gallbladder to facilitate dissection."
Comparison of dense captioning quality across models, showing improved temporal coverage and procedural detail with MedGRPO.
MedGRPO generates more clinically coherent and terminology-precise summaries compared to off-the-shelf and SFT baselines.
Analysis of failure modes — understanding where MedGRPO still struggles informs future research directions.
@inproceedings{su2026medgrpo,
title={{MedGRPO}: Multi-Task Reinforcement Learning for
Heterogeneous Medical Video Understanding},
author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and
Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and
Shen, Yuhan and Innanje, Arun and Chen, Terrence and
Elhamifar, Ehsan and Wu, Ziyan},
booktitle={Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR)},
year={2026}
}