InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

Yuhang Liu et al.
InfiX.ai
GUI Agent Multimodal Large Language Model Computer Vision Automation

Abstract

InfiGUIAgent is a multimodal generalist GUI agent trained through a two-stage supervised fine-tuning approach, focusing on fundamental GUI understanding skills and advanced reasoning capabilities for native GUI interactions.

This is the repo for the paper “InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection”. In this work, we develop a multimodal large language model-based GUI agent that enables enhanced task automation on computing devices. Our agent is trained through a two-stage supervised fine-tuning approach that focuses on fundamental GUI understanding skills and advanced reasoning capabilities, where we integrate hierarchical reasoning and expectation-reflection reasoning to enable native reasoning abilities in GUI interactions.

News

InfiGUIAgent

We are in the process of uploading key artifacts from our paper to our Hugging Face Collection.

Regarding the full model release, due to licensing restrictions on portions of our training data from third-party sources, we are currently sanitizing the dataset and retraining/refining the final model to ensure full compliance while maintaining performance.

Stay tuned for updates!