CVPR 2026

MedGRPO: Multi-Task Reinforcement Learning
for Heterogeneous Medical Video Understanding

1Northeastern University    2United Imaging Intelligence
MedVidBench Dataset Pipeline

Abstract

Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce MedVidBench, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation.

While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce MedGRPO, a novel RL framework for balanced multi-dataset training with two key innovations: (1) cross-dataset reward normalization that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a medical LLM judge that evaluates caption quality on five clinical dimensions through comparative similarity scoring.

Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench substantially outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, while MedGRPO further improves upon the SFT baseline across grounding and captioning tasks. Our work establishes a foundational benchmark and robust training methodology for advancing vision-language models in medical domains.

Method Overview

MedVidBench

531,850 video-instruction pairs across 8 medical sources and 8 tasks. Multi-perspective quality assurance pipeline with expert-guided prompting and dual-model validation (GPT-4.1 + Gemini-2.5-Flash).

Supervised Fine-Tuning

Fine-tuning Qwen2.5-VL-7B on MedVidBench establishes a strong baseline that outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks.

Reward Normalization

Logistic normalization maps each dataset's median performance to a fixed reward value (0.5), ensuring fair optimization across heterogeneous datasets and preventing training collapse.

Medical LLM Judge

GPT-4.1-based comparative evaluation across five clinical dimensions: medical terminology, instrument/anatomy ID, specificity, clinical context, and action accuracy.

MedGRPO Pipeline

Key Innovations

Two technical contributions that enable stable multi-dataset RL training

Cross-Dataset Reward Normalization

Standard RL fails on heterogeneous medical datasets due to vastly different difficulty levels (e.g., CoPESD STG median mIoU ~0.5 vs EgoSurgery STG ~0.12). We introduce logistic normalization that maps each dataset's median performance to a fixed reward value (0.5):

rnorm(x) = 1 / (1 + exp(-k · (x - p50) / IQR))

Benefits: (1) Median fairness — equal rewards at median performance across all datasets, (2) Smooth gradients — non-zero derivatives everywhere, (3) Outlier robustness — IQR-based scaling, (4) Bounded output — compatible with GRPO group normalization.

Medical LLM Judge

Standard semantic similarity fails to capture clinical correctness. For example, "The tool grasps tissue in the upper area" vs "The grasper dissects the cystic duct in the upper right quadrant" achieve high similarity (~0.82) but differ critically in medical accuracy.

Our GPT-4.1-based judge uses comparative similarity scoring across five clinical dimensions:

  • Medical terminology precision — Clinical terms vs lay language
  • Instrument & anatomy identification — Specific tools and structures
  • Specificity vs vagueness — Precise details vs generic descriptions
  • Clinical procedure context — Workflow and safety awareness
  • Action & state accuracy — Surgical actions and tissue states

Hybrid design: Combine LLM judge (50%) with semantic similarity (50%) for robust evaluation capturing both clinical correctness and overall coherence.

MedVidBench

A comprehensive benchmark for medical video understanding

MedVidBench systematically transforms existing expert annotations—bounding boxes, procedure transcripts, action labels—into instruction-following format through a multi-perspective quality assurance pipeline with dual-model validation (GPT-4.1 + Gemini-2.5-Flash).

531,850
Video-Instruction Pairs
Large-Scale Version
51,505
Balanced Pairs
Standard Version
8
Medical Data Sources
626 unique videos
8
Task Types
3 Granularity Levels

Two versions: Large-Scale (531K samples, maximum data) for scaling experiments, and Standard (51K samples, task-balanced) for efficient multi-task learning.

Temporal Action Grounding (TAG)
Spatiotemporal Grounding (STG)
Video Summarization (VS)
Region Captioning (RC)
Dense Video Captioning (DVC)
Next Action Prediction (NAP)
Skill Assessment (SA)
Critical View of Safety (CVS)

Data Sources

CholecT50
CholecTrack20
Cholec80-CVS
CoPESD
AVOS
EgoSurgery
JIGSAWS
NurViD

Laparoscopic    Open Surgery    Robotic Surgery    Nursing

Dataset Distribution Analysis

Distribution across 532K QA instances: (Left) Answer length from 1 to 1,170 words — short answers (≤5 words, 28.1%) from temporal grounding, long answers (>20 words, 51.8%) from captioning tasks. (Middle) Video durations from 20s to 1,800s with a long-tail pattern. (Right) FPS distribution — 0.5 FPS (63.3%), 1.0 FPS (22.0%), 2.0 FPS (7.5%).

Dataset distribution analysis

Results

MedGRPO achieves consistent gains over SFT baseline across all tasks

Key Findings

  • SFT outperforms closed-source models: Our SFT baseline substantially outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks (e.g., CVS: 0.894 vs 0.018/0.101, STG: 0.177 vs 0.014/0.047).
  • MedGRPO improves upon SFT: Consistent improvements across most tasks - CVS (+0.020 to 0.914), STG (+0.025 to 0.202), TAG@0.3 (+0.074 to 0.216), VS (+0.588 to 4.184), RC (+0.685 to 3.442).
  • Caption tasks benefit most: LLM judge-based evaluation drives substantial gains in Video Summarization (+16.4%) and Region Captioning (+24.8%).
  • Reward normalization prevents collapse: Without normalization, training collapses with performance dropping dramatically (CVS: 0.894→0.020, STG: 0.177→0.010).
  • Multi-task learning synergy: Training with caption tasks improves grounding performance (STG +4.7%, TAG@0.3 +6.9%, TAG@0.5 +9.9%).
Model CVSacc NAPacc SAacc STGmIoU TAG@0.3 TAG@0.5 DVCllm DVCF1 VSllm RCllm
GPT-4.1 0.0180.2500.0870.014 0.0960.0052.4380.1012.4902.080
Gemini-2.5-Flash 0.1010.2280.1070.047 0.0450.0212.3870.0842.3521.912
VideoChat-R1.5-7B 0.0000.2700.0060.000 0.0090.0051.7230.0263.0343.086
Qwen2.5VL-7B 0.1050.1510.0100.020 0.0060.0682.5120.0752.4522.090
Qwen2.5VL-7BSurg-CholecT50 (NVIDIA) 0.0000.3020.0000.000 0.0190.0131.9450.0512.1012.986
Qwen2.5VL-7BSFT (Ours) 0.8940.4420.2180.177 0.1420.0913.6650.1653.5962.757
Qwen2.5VL-7BMedGRPO (Ours) 0.9140.4270.2440.202 0.2160.1563.7970.2104.1843.442
Qwen3VL-4B 0.0000.1780.0060.000 0.0390.0341.9390.1282.9262.853
Qwen3VL-4BSFT (Ours) 0.8950.4660.2700.170 0.4650.4033.8620.4354.1803.752
Qwen3VL-4BMedGRPO (Ours) 0.8980.4730.2850.212 0.5040.4413.9500.4914.2273.861

■ Best   ■ 2nd Best    Accuracy (CVS/NAP/SA) · mIoU (STG/TAG) · LLM Judge (DVC/VS/RC) · F1 (DVC)

Qualitative Results

MedGRPO generates clinically accurate descriptions across diverse tasks

Region Captioning

MedGRPO produces precise instrument identification, accurate spatial localization, and clinical context understanding compared to GPT-4.1, Gemini-2.5-Flash, and the SFT baseline.

Ground Truth: "The grasper consistently grips and retracts the gallbladder towards the top left of the surgical field, providing counter-traction and exposure."

GPT-4.1: Generic descriptions without specific instruments.

Gemini-2.5-Flash: Misidentifies tool as "electrocautery hook" with incorrect actions.

SFT Baseline: Identifies "grasper" but uses vague spatial terms ("right side").

MedGRPO (Ours): "The grasper, positioned primarily on the upper left, steadily holds and maintains exposure of the surgical field, retracting the gallbladder to facilitate dissection."

Qualitative comparison of region captioning

Dense Video Captioning

Comparison of dense captioning quality across models, showing improved temporal coverage and procedural detail with MedGRPO.

Qualitative results for dense video captioning

Video Summarization

MedGRPO generates more clinically coherent and terminology-precise summaries compared to off-the-shelf and SFT baselines.

Qualitative results for video summarization

Failure Cases

Analysis of failure modes — understanding where MedGRPO still struggles informs future research directions.

Failure case analysis

Citation

@inproceedings{su2026medgrpo,
  title={{MedGRPO}: Multi-Task Reinforcement Learning for
         Heterogeneous Medical Video Understanding},
  author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and
          Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and
          Shen, Yuhan and Innanje, Arun and Chen, Terrence and
          Elhamifar, Ehsan and Wu, Ziyan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer
             Vision and Pattern Recognition (CVPR)},
  year={2026}
}