Pixel-Searcher

From Web to Pixels: Bringing Agentic Search into Visual Perception

WebEyes Benchmark ยท Search-Based Grounding ยท Segmentation ยท VQA

What is Pixel-Searcher?

We introduce WebEyes, a benchmark for search-based visual reasoning where the target object cannot be reliably resolved from the image alone. Models must connect visual evidence with external knowledge, then return grounded outputs for object localization, pixel-level segmentation, or target-aware multiple-choice VQA.

Pixel-Searcher is our reference method for this setting. It combines agentic web search, visual target disambiguation, and task-specific prediction formatting, enabling reproducible evaluation across boxes, masks, and answer choices.

Pixel-Searcher Overview

Pixel-Searcher bridges web-scale knowledge with pixel-level visual perception through agentic multi-round search and reasoning.

Comparison of visual-cue segmentation, reasoning segmentation, and Pixel-Searcher deep research
Figure 1. Comparison between visual-cue segmentation, reasoning segmentation, and our Pixel-Searcher approach that augments perception with agentic web search.

WebEyes Benchmark

A knowledge-intensive visual reasoning benchmark spanning 6 categories with grounding, segmentation, and VQA tasks.

WebEyes task overview for segmentation, grounding, and VQA.
Figure 2. WebEyes contains three task views: Search-based Segmentation, Search-based Grounding, and Search-based VQA.
StatisticsCount
Source images120
Object annotations473
Grounding rows645
Segmentation rows645
VQA rows637
Total task instances1,927

Category Distribution

Performance Comparison

Comprehensive evaluation across all three WebEyes tasks. Switch tabs to explore different tasks.

Overall Performance

Category-wise Analysis

Example 1: Brand Ambassador Identification
Question: "Please find 'the person who became a brand ambassador for the South Korean brand NE:AR in 2025' in the image."
Questions require external knowledge (identifying the ambassador for a specific brand in a given year) before the visible target (Sana) can be localized and segmented.
Original Image
Original Image
Predicted Mask
Mask Image

BibTeX

@misc{yang2026webpixelsbringingagentic,
      title={From Web to Pixels: Bringing Agentic Search into Visual Perception}, 
      author={Bokang Yang and Xinyi Sun and Kaituo Feng and Xingping Dong and Dongming Wu and Xiangyu Yue},
      year={2026},
      eprint={2605.12497},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.12497}, 
}