Abstract
We introduce InfiGUI-G1, a multimodal GUI agent that employs Adaptive Exploration Policy Optimization (AEPO) to improve semantic alignment in GUI grounding, achieving up to 8.3% relative improvement over baseline methods.
Following the InfiGUI-R1, we observe that Reinforcement Learning with Verifiable Rewards (RLVR) improves data efficiency by optimizing sequential coordinate generation, thereby enhancing spatial alignment. However, all existing RLVR methods suffer from a critical yet overlooked limitation: inefficient exploration. Because they sample actions solely from the model’s current policy, these methods tend to over exploit high-confidence errors, a phenomenon we term the confidence trap. Once trapped, the model repeatedly chooses an incorrect action with high confidence, failing to explore alternative actions that—although low probability—may be correct. This bottlenecks semantic alignment.

Why previous RLVR fails
In this work, we introduce Adaptive Exploration Policy Optimization (AEPO), a novel policy optimization framework for multimodal large language models that addresses semantic alignment challenges in GUI grounding. Using multi-answer generation and theoretically grounded Adaptive Exploration Reward, InfiGUI-G1-3B and InfiGUI-G1-7B achieve state-of-the-art performance with up to 8.3% relative improvement over baseline methods.

Adaptive Exploration Policy Optimization
AEPO improves policy learning through three complementary components:
- Multi-Answer Generation – broadens search space to escape high-confidence traps.
- Adaptive Exploration Reward (AER) – dynamically balances exploration benefits and costs.
- Collinear Penalty – ensures spatial diversity in exploration.
Instead of predicting a single action, AEPO generates N candidate points in one forward pass. This increases the chance of sampling low-probability but correct actions, especially for semantically challenging cases.
We define efficiency as eta = U/C, where utility (U) reflects exploration success (+1 if any candidate hits the ground truth, −1 otherwise) and cost (C) accounts for both proposal size and verification effort via geometric mean.
If generated points are nearly collinear—indicating a trivial scanning strategy—the accuracy reward is overridden with a strong penalty. This encourages spatially diverse exploration patterns, improving the effectiveness of multi-answer search.
Our InfiGUI-G1 models, trained with the AEPO framework, establish new state-of-the-art results among open-source models across a diverse and challenging set of GUI grounding benchmarks:
Benchmark Evaluation of InfiGUI-G1-7B
Model | MMBench-GUI | ScreenSpot-v2 | UI-Vision | I2E-Bench | ScreenSpot-Pro |
---|---|---|---|---|---|
Qwen2.5-VL-7B | 33.9 | 88.8 | 0.9 | 53.8 | - |
GUI-G²-7B | - | 93.3 | - | - | 47.5 |
UI-TARS-7B | - | 91.6 | 17.6 | 61.4 | 35.7 |
UGround-v1-7B | 65.7 | - | 12.9 | 70.3 | - |
UI-TARS-1.5-7B | 64.3 | - | - | 73.2 | 49.6 |
Qwen2.5-VL-72B | 41.8 | - | - | 51.4 | - |
UGround-v1-72B | - | - | 23.2 | 76.3 | - |
UI-TARS-72B | 74.3 | 90.3 | 25.5 | 73.7 | - |
Ours | |||||
InfiGUI-G1-7B | 80.8 | 93.5 | 26.1 | 77.4 | 51.9 |
w/ Expl. Success | 86.4 | 95.6 | 34.4 | 83.0 | 58.0 |
Benchmark Evaluation of InfiGUI-G1-3B
Model | MMBench-GUI | ScreenSpot-v2 | UI-Vision | I2E-Bench | ScreenSpot-Pro |
---|---|---|---|---|---|
Qwen2.5-VL-3B | - | 80.9 | - | 41.7 | - |
UI-R1-E-3B | - | - | - | 69.1 | 33.5 |
Aguvis-7B | 45.7 | - | 13.7 | 53.2 | - |
OS-Atlas-7B | 41.4 | 85.1 | 9.0 | 58.6 | - |
Ours | |||||
InfiGUI-G1-3B | 73.4 | 91.1 | 22.0 | 72.6 | 45.2 |
w/ Expl. Success | 81.6 | 94.4 | 29.7 | 82.8 | 52.0 |
Citation Information
If you find this work useful, citations to the following papers are welcome:
@misc{liu2025infiguig1advancingguigrounding,
title={InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization},
author={Yuhang Liu and Zeyu Liu and Shuanghe Zhu and Pengxiang Li and Congkai Xie and Jiasheng Wang and Xueyu Hu and Xiaotian Han and Jianbo Yuan and Xinyao Wang and Shengyu Zhang and Hongxia Yang and Fei Wu},
year={2025},
eprint={2508.05731},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.05731},
}
@article{liu2025infigui,
title={InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners},
author={Liu, Yuhang and Li, Pengxiang and Xie, Congkai and Hu, Xavier and Han, Xiaotian and Zhang, Shengyu and Yang, Hongxia and Wu, Fei},
journal={arXiv preprint arXiv:2504.14239},
year={2025}
}
@article{liu2025infiguiagent,
title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
journal={arXiv preprint arXiv:2501.04575},
year={2025}
}
Acknowledgements
We would like to express our gratitude for the following open-source projects: EasyR1, VERL, LLaMA-Factory, and Qwen2.5-VL.