MedGRPO — CVPR 2026

Abstract

Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce MedVidBench, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation.

While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce MedGRPO, a novel RL framework for balanced multi-dataset training with two key innovations: (1) cross-dataset reward normalization that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a medical LLM judge that evaluates caption quality on five clinical dimensions through comparative similarity scoring.

Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench substantially outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, while MedGRPO further improves upon the SFT baseline across grounding and captioning tasks. Our work establishes a foundational benchmark and robust training methodology for advancing vision-language models in medical domains.

Method Overview

MedVidBench

531,850 video-instruction pairs across 8 medical sources and 8 tasks. Multi-perspective quality assurance pipeline with expert-guided prompting and dual-model validation (GPT-4.1 + Gemini-2.5-Flash).

Supervised Fine-Tuning

Fine-tuning Qwen2.5-VL-7B on MedVidBench establishes a strong baseline that outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks.

Reward Normalization

Logistic normalization maps each dataset's median performance to a fixed reward value (0.5), ensuring fair optimization across heterogeneous datasets and preventing training collapse.

Medical LLM Judge

GPT-4.1-based comparative evaluation across five clinical dimensions: medical terminology, instrument/anatomy ID, specificity, clinical context, and action accuracy.

Key Innovations

Two technical contributions that enable stable multi-dataset RL training

Cross-Dataset Reward Normalization

Standard RL fails on heterogeneous medical datasets due to vastly different difficulty levels (e.g., CoPESD STG median mIoU ~0.5 vs EgoSurgery STG ~0.12). We introduce logistic normalization that maps each dataset's median performance to a fixed reward value (0.5):

        rnorm(x) = 1 / (1 + exp(-k · (x - p50) / IQR))
      

Benefits: (1) Median fairness — equal rewards at median performance across all datasets, (2) Smooth gradients — non-zero derivatives everywhere, (3) Outlier robustness — IQR-based scaling, (4) Bounded output — compatible with GRPO group normalization.

Medical LLM Judge

Standard semantic similarity fails to capture clinical correctness. For example, "The tool grasps tissue in the upper area" vs "The grasper dissects the cystic duct in the upper right quadrant" achieve high similarity (~0.82) but differ critically in medical accuracy.

Our GPT-4.1-based judge uses comparative similarity scoring across five clinical dimensions:

Medical terminology precision — Clinical terms vs lay language
Instrument & anatomy identification — Specific tools and structures
Specificity vs vagueness — Precise details vs generic descriptions
Clinical procedure context — Workflow and safety awareness
Action & state accuracy — Surgical actions and tissue states

Hybrid design: Combine LLM judge (50%) with semantic similarity (50%) for robust evaluation capturing both clinical correctness and overall coherence.

MedVidBench

A comprehensive benchmark for medical video understanding

MedVidBench systematically transforms existing expert annotations—bounding boxes, procedure transcripts, action labels—into instruction-following format through a multi-perspective quality assurance pipeline with dual-model validation (GPT-4.1 + Gemini-2.5-Flash).

531,850

Video-Instruction Pairs

Large-Scale Version

51,505

Balanced Pairs

Standard Version

8

Medical Data Sources

626 unique videos

8

Task Types

3 Granularity Levels

Two versions: Large-Scale (531K samples, maximum data) for scaling experiments, and Standard (51K samples, task-balanced) for efficient multi-task learning.

Temporal Action Grounding (TAG)

Spatiotemporal Grounding (STG)

Video Summarization (VS)

Region Captioning (RC)

Dense Video Captioning (DVC)

Next Action Prediction (NAP)

Skill Assessment (SA)

Critical View of Safety (CVS)

Data Sources

CholecT50

CholecTrack20

Cholec80-CVS

CoPESD

AVOS

EgoSurgery

JIGSAWS

NurViD

■ Laparoscopic ■ Open Surgery ■ Robotic Surgery ■ Nursing

Dataset Distribution Analysis

Distribution across 532K QA instances: (Left) Answer length from 1 to 1,170 words — short answers (≤5 words, 28.1%) from temporal grounding, long answers (>20 words, 51.8%) from captioning tasks. (Middle) Video durations from 20s to 1,800s with a long-tail pattern. (Right) FPS distribution — 0.5 FPS (63.3%), 1.0 FPS (22.0%), 2.0 FPS (7.5%).

Results

MedGRPO achieves consistent gains over SFT baseline across all tasks

Key Findings

SFT outperforms closed-source models: Our SFT baseline substantially outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks (e.g., CVS: 0.894 vs 0.018/0.101, STG: 0.177 vs 0.014/0.047).
MedGRPO improves upon SFT: Consistent improvements across most tasks - CVS (+0.020 to 0.914), STG (+0.025 to 0.202), TAG@0.3 (+0.074 to 0.216), VS (+0.588 to 4.184), RC (+0.685 to 3.442).
Caption tasks benefit most: LLM judge-based evaluation drives substantial gains in Video Summarization (+16.4%) and Region Captioning (+24.8%).
Reward normalization prevents collapse: Without normalization, training collapses with performance dropping dramatically (CVS: 0.894→0.020, STG: 0.177→0.010).
Multi-task learning synergy: Training with caption tasks improves grounding performance (STG +4.7%, TAG@0.3 +6.9%, TAG@0.5 +9.9%).

Model	CVS_acc	NAP_acc	SA_acc	STG_mIoU	TAG_@0.3	TAG_@0.5	DVC_llm	DVC_F1	VS_llm	RC_llm
GPT-4.1	0.018	0.250	0.087	0.014	0.096	0.005	2.438	0.101	2.490	2.080
Gemini-2.5-Flash	0.101	0.228	0.107	0.047	0.045	0.021	2.387	0.084	2.352	1.912
VideoChat-R1.5-7B	0.000	0.270	0.006	0.000	0.009	0.005	1.723	0.026	3.034	3.086
Qwen2.5VL-7B	0.105	0.151	0.010	0.020	0.006	0.068	2.512	0.075	2.452	2.090
Qwen2.5VL-7B_{Surg-CholecT50} (NVIDIA)	0.000	0.302	0.000	0.000	0.019	0.013	1.945	0.051	2.101	2.986
Qwen2.5VL-7B_SFT (Ours)	0.894	0.442	0.218	0.177	0.142	0.091	3.665	0.165	3.596	2.757
Qwen2.5VL-7B_MedGRPO (Ours)	0.914	0.427	0.244	0.202	0.216	0.156	3.797	0.210	4.184	3.442
Qwen3VL-4B	0.000	0.178	0.006	0.000	0.039	0.034	1.939	0.128	2.926	2.853
Qwen3VL-4B_SFT (Ours)	0.895	0.466	0.270	0.170	0.465	0.403	3.862	0.435	4.180	3.752
Qwen3VL-4B_MedGRPO (Ours)	0.898	0.473	0.285	0.212	0.504	0.441	3.950	0.491	4.227	3.861

■ Best ■ 2nd Best Accuracy (CVS/NAP/SA) · mIoU (STG/TAG) · LLM Judge (DVC/VS/RC) · F1 (DVC)

Qualitative Results

MedGRPO generates clinically accurate descriptions across diverse tasks

Region Captioning

MedGRPO produces precise instrument identification, accurate spatial localization, and clinical context understanding compared to GPT-4.1, Gemini-2.5-Flash, and the SFT baseline.

Ground Truth: "The grasper consistently grips and retracts the gallbladder towards the top left of the surgical field, providing counter-traction and exposure."

GPT-4.1: Generic descriptions without specific instruments.

Gemini-2.5-Flash: Misidentifies tool as "electrocautery hook" with incorrect actions.

SFT Baseline: Identifies "grasper" but uses vague spatial terms ("right side").

MedGRPO (Ours): "The grasper, positioned primarily on the upper left, steadily holds and maintains exposure of the surgical field, retracting the gallbladder to facilitate dissection."

Qualitative comparison of region captioning

Dense Video Captioning

Comparison of dense captioning quality across models, showing improved temporal coverage and procedural detail with MedGRPO.

Video Summarization

MedGRPO generates more clinically coherent and terminology-precise summaries compared to off-the-shelf and SFT baselines.

Failure Cases

Analysis of failure modes — understanding where MedGRPO still struggles informs future research directions.

Citation

@inproceedings{su2026medgrpo,
  title={{MedGRPO}: Multi-Task Reinforcement Learning for
         Heterogeneous Medical Video Understanding},
  author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and
          Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and
          Shen, Yuhan and Innanje, Arun and Chen, Terrence and
          Elhamifar, Ehsan and Wu, Ziyan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer
             Vision and Pattern Recognition (CVPR)},
  year={2026}
}

MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

Abstract

Method Overview

MedVidBench

Supervised Fine-Tuning

Reward Normalization

Medical LLM Judge

Key Innovations

Cross-Dataset Reward Normalization

Medical LLM Judge

MedVidBench

Data Sources

Dataset Distribution Analysis

Results

Key Findings

Qualitative Results

Region Captioning

Dense Video Captioning

Video Summarization

Failure Cases

Citation

MedGRPO: Multi-Task Reinforcement Learning
for Heterogeneous Medical Video Understanding