logo

LASER:Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding

Wanfu Wang, Qipeng Huang, Guangquan Xue, Xiaobo Liang, Juntao Li
Soochow University
teaser

We propose LASER, a self-evolving framework that integrates Monte Carlo quality estimation with Intersection-over- Union (IoU)-based region quality evaluation to jointly encourage both accuracy and diversity in constructing high-quality preference data. This combination explicitly guides the model to focus on instruction-relevant key regions while adaptively allocating reasoning steps based on task complexity.

Method Overview

Overview: Given a user instruction and the original image, the trained MLASER model progressively focuses on key regions through a multi-step reasoning process. At each step, the Visual CoT captures critical cues (highlighted in red within the <think> tag) based on the current focus region. Below, we also illustrate the multi-stage self-evolving optimization process that elicits LASER’s multi-step active perception capabilities.
  1. Eliciting Active Perception through Visual Cropping. Given the paired training data, we prompt the VLM back-bone Mraw to predict a focused region. The correspond-ing region is then cropped from the original image and integrated into the CoT as visual context, guiding the model toward accurate click-coordinate prediction. To improve the quality of reasoning trajectories, we adopt a STaR-style rejection sampling strategy to construct the dataset Dsft, which is used to finetune Msft.
  2. Learning Focused Region Preferences. We sample mul-tiple reasoning trajectories from Msft and estimate region-wise preferences using Monte Carlo Estimation. An IoU-based filter is applied to remove low-quality candidates. The resulting preference pairs dataset Ddpo are used to train a stronger model Mdpo via DPO.
  3. Difficulty-Aware Multi-step Perception. While Mdpo supports single-step perception, it is prone to failure in complex scenarios that demand deeper reasoning. To overcome this limitation, we allow Mdpo to iteratively generate multi-step reasoning trajectories, enabling the construction of a diverse and difficulty-aware training data. The final model is then trained on this multi-step dataset D, making it with the ability to dynamically adjust reasoning depth based on the difficulty of the query.

Dataset

We release the synthetic training dataset used in our paper to facilitate further research in GUI grounding and preference optimization. The dataset is hosted on 🤗 Hugging Face.

Performance on Screenspot-V2 and Screenspot-Pro

As shown in Table 2 and Table 1, we conduct comprehensive comparisons on both ScreenSpot-v2 and ScreenSpot-Pro benchmarks. The evaluation covers six GUI domains and two task types (Text and Icon grounding). Our method, LASER, consistently outperforms previous models in terms of both overall grounding accuracy and generalization ability across different domains, demonstrating the effectiveness and robustness of our self-evolving training strategy.

More Cases

As shown below, we present some examples from the test samples.The meaning of some elements in the figure is:

  • Red box: first crop area
  • Blue box: second crop area
  • Green box: ground truth boundary (correct if click falls inside)
  • Red point: model's predicted click coordinates

BibTeX


    @misc{wang2025learningactiveperceptionselfevolving,
      title={Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding}, 
      author={Wanfu Wang and Qipeng Huang and Guangquan Xue and Xiaobo Liang and Juntao Li},
      year={2025},
      eprint={2509.04243},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.04243}, 
    }