InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

Yuhang Liu et al.

InfiX.ai

GUI Agent Multimodal Large Language Model Computer Vision Automation

Abstract

InfiGUIAgent is a multimodal generalist GUI agent trained through a two-stage supervised fine-tuning approach, focusing on fundamental GUI understanding skills and advanced reasoning capabilities for native GUI interactions.

This is the repo for the paper “InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection”. In this work, we develop a multimodal large language model-based GUI agent that enables enhanced task automation on computing devices. Our agent is trained through a two-stage supervised fine-tuning approach that focuses on fundamental GUI understanding skills and advanced reasoning capabilities, where we integrate hierarchical reasoning and expectation-reflection reasoning to enable native reasoning abilities in GUI interactions.

News

[2025/5/15] Our paper “OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use” is accepted by ACL 2025.
[2025/4/19] Our paper “InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners” released. More information can be found in the repository.
[2025/1/9] Our paper “InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection” released.
[2024/12/12] Our paper “OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use” released.
[2024/4/2] Our paper “InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks” is accepted by ICML 2024.

InfiGUIAgent

We are in the process of uploading key artifacts from our paper to our Hugging Face Collection.

Regarding the full model release, due to licensing restrictions on portions of our training data from third-party sources, we are currently sanitizing the dataset and retraining/refining the final model to ensure full compliance while maintaining performance.

Stay tuned for updates!