Abstract
We introduce InfiMed-Series models, InfiMed-SFT-3B and InfiMed-RL-3B, medical-focused Multimodal Large Language Models (MLLMs) developed by the InfiX-AI team. InfiMed-RL-3B achieves an average accuracy of 59.2% across seven authoritative medical benchmarks (including MMMU Health & Medicine, OmniMedVQA, and PMC-VQA), significantly outperforming all comparable-scale models such as MedGemma-4B-IT (54.8%) and even surpassing the larger InternVL3-8B (57.3%).
In recent years, multimodal large language models (MLLMs) have achieved remarkable progress in areas such as visual understanding, mathematical reasoning, and general-purpose dialogue, becoming a major direction in the development of artificial intelligence. By processing multiple modalities such as images and text simultaneously, they have opened up vast possibilities for advancing toward general artificial intelligence.
However, this situation in the medical domain is quite different. Compared with abundant general-domain internet data, medical data is scarce, often lacks complete reasoning chains, and involves high-stakes scenarios demanding explainable reasoning.

Left: comparison between sparse outputs and reflective, information-dense outputs. Right: enhanced exploration (bottom) enables broader and more effective search spaces, while limited exploration (top) leads to narrower, less efficient search spaces.
Furthermore, while general reinforcement learning methods (e.g., RLHF, RLVR) excel in broad tasks, their exploration within medical scenarios has been limited and often fails to consistently enhance model generalization.
To address these issues, we propose the InfiMed training paradigm, which achieves a breakthrough in understanding and reasoning for medical MLLMs under low-resource conditions. Our InfiMed-Series 3B models surpass models with similar-scale like Google’s MedGemma-4B-IT and even surpassing the larger InternVL3-8B, establishes a new state-of-the-art.

Training pipeline of the InfiMed-Series models.
Core Advantages
- Mixed Data & Multi-Dimensional Rewards: Integrating medical textual data, medical multimodal data, and general multimodal data in the SFT phase, and combining them with multi-dimensional rewards during the RLVR phase, enhances accuracy and generalization despite the limited availability of high-quality medical data.
- Early Infusion of Reflective CoT: Introducing Reflective Chain-of-Thought during fine-tuning equips the model with self-reflection and generates richer candidate answers for subsequent RLVR, while data quality is safeguarded through rejection sampling and cross-model verification to ensure accuracy and reliability.
- Sample-Efficient Training: InfiMed-Series models, trained with only 188K SFT and 36K RLVR samples, demonstrate strong understanding and reasoning capabilities on seven medical benchmarks. Notably, InfiMed-RL-3B achieves an average accuracy of 59.2% across seven major medical benchmarks, significantly outperforming the comparable-sized Med-Gemma-4B-IT (54.8%) and even surpassing the larger InternVL3-8B (57.3%), demonstrating its exceptional advantage.

Data composition of the InfiMed-Series models: the SFT phase utilizes 188K samples, followed by the RLVR phase with 36K samples.
Quantitative Results
InfiMed-RL-3B, with an average score of 59.2, demonstrates comprehensive leadership across 7 benchmarks, where MMMU-H&M, OMVQA, and MedXQA refer to MMMU-Health & Medicine, OmniMedVQA, and MedXpertQA-Multimodal, respectively.
It not only outperforms comparable-scale models like Google’s Med-Gemma-4B-IT (54.8) but also surpasses larger models such as HuatuoGPT-V-7B (54.2) and InternVL3-8B (57.3), highlighting its exceptional resource efficiency and reasoning capability.
Model | Size | MMMU-H&M | VQA-RAD | SLAKE | PathVQA | PMC-VQA | OMVQA | MedXQA | Avg. |
---|---|---|---|---|---|---|---|---|---|
Proprietary Models | |||||||||
GPT-5 | - | 83.60 | 67.80 | 78.10 | 52.80 | 60.00 | 76.40 | 71.00 | 70.00 |
GPT-5-mini | - | 80.50 | 66.30 | 76.10 | 52.40 | 57.60 | 70.90 | 60.10 | 66.30 |
GPT-5-nano | - | 74.10 | 55.40 | 69.30 | 45.40 | 51.30 | 66.50 | 45.10 | 58.20 |
GPT-4.1 | - | 75.20 | 65.00 | 72.20 | 55.50 | 55.20 | 75.50 | 45.20 | 63.40 |
Claude Sonnet 4 | - | 74.60 | 67.60 | 70.60 | 54.20 | 54.40 | 65.50 | 43.30 | 61.50 |
Gemini-2.5-Flash | - | 76.90 | 68.50 | 75.80 | 55.40 | 55.40 | 71.00 | 52.80 | 65.10 |
General Open-source Models | |||||||||
Qwen2.5VL-3B | 3B | 51.30 | 56.80 | 63.20 | 37.10 | 50.60 | 64.50 | 20.70 | 49.20 |
Qwen2.5VL-7B | 7B | 54.00 | 64.96 | 67.62 | 44.60 | 51.25 | 63.47 | 21.70 | 52.51 |
InternVL2.5-8B | 8B | 53.50 | 59.40 | 69.00 | 42.10 | 51.30 | 81.30 | 21.70 | 54.00 |
InternVL3-8B | 8B | 59.20 | 65.40 | 72.80 | 48.60 | 53.80 | 79.10 | 22.40 | 57.30 |
Medical Open-source Models | |||||||||
MedGemma-4B-IT | 4B | 43.70 | 72.50 | 76.40 | 48.80 | 49.90 | 69.80 | 22.30 | 54.80 |
LLaVA-Med-7B | 7B | 29.30 | 53.70 | 48.00 | 38.80 | 30.50 | 44.30 | 20.30 | 37.80 |
HuatuoGPT-V-7B | 7B | 47.30 | 67.00 | 67.80 | 48.00 | 53.30 | 74.20 | 21.60 | 54.20 |
Lingshu-7B | 7B | 54.00 | 67.90 | 83.10 | 61.90 | 56.30 | 82.90 | 26.70 | 61.80 |
BioMediX2-8B | 8B | 39.80 | 49.20 | 57.70 | 37.00 | 43.50 | 63.30 | 21.80 | 44.60 |
InfiMed-Series Model | |||||||||
InfiMed-SFT-3B | 3B | 54.67 | 58.09 | 82.00 | 60.59 | 53.22 | 67.01 | 23.55 | 57.02 |
InfiMed-RL-3B | 3B | 55.33 | 60.53 | 82.38 | 61.97 | 58.74 | 71.71 | 23.60 | 59.18 |
Qualitative Results
Comparative analysis reveals that InfiMed-RL-3B delivers the most direct and accurate answers. In contrast, InfiMed-SFT-3B sometimes requires lengthy analysis or “reflection” steps to reach the correct conclusion. This validates that the RLVR stage effectively enhances the model’s decision-making efficiency and final accuracy, making it more suitable for high-stakes scenarios such as clinical or diagnostic assistance.

A case study comparing Qwen2.5-VL-3B and the InfiMed model series on Medical Visual Question Answering (VQA). Content highlighted in red indicates errors or irrelevant information, while dark green signifies correct or critical content. Qwen2.5-VL-3B misunderstood the examination required by the patient, leading to an incorrect answer. Both InfiMed-SFT-3B and InfiMed-RL-3B provided the correct response, but InfiMed-RL-3B delivered a more direct, clear, and efficient answer.
Citation Information
If you find this work useful, citations to the following papers are welcome:
@misc{liu2025infimedlowresourcemedicalmllms,
title={InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning},
author={Zeyu Liu and Zhitian Hou and Guanghao Zhu and Zhijie Sang and Congkai Xie and Hongxia Yang},
year={2025},
eprint={2505.23867},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.23867},
}
Acknowledgements
We would like to express our gratitude for the following open-source projects: EasyR1, VERL, LLaMA-Factory, and Qwen2.5-VL.