Pixel-Searcher

From Web to Pixels: Bringing Agentic Search into Visual Perception

WebEyes Benchmark · Search-Based Grounding · Segmentation · VQA

AAbstract PPaper CCode DDataset LLeaderboard

📝 Abstract

What is Pixel-Searcher?

We introduce WebEyes, a benchmark for search-based visual reasoning where the target object cannot be reliably resolved from the image alone. Models must connect visual evidence with external knowledge, then return grounded outputs for object localization, pixel-level segmentation, or target-aware multiple-choice VQA.

Pixel-Searcher is our reference method for this setting. It combines agentic web search, visual target disambiguation, and task-specific prediction formatting, enabling reproducible evaluation across boxes, masks, and answer choices.

🔬 Method

Pixel-Searcher Overview

Pixel-Searcher bridges web-scale knowledge with pixel-level visual perception through agentic multi-round search and reasoning.

Comparison of visual-cue segmentation, reasoning segmentation, and Pixel-Searcher deep research — **Figure 1.** Comparison between visual-cue segmentation, reasoning segmentation, and our Pixel-Searcher approach that augments perception with agentic web search.

📊 Dataset

WebEyes Benchmark

A knowledge-intensive visual reasoning benchmark spanning 6 categories with grounding, segmentation, and VQA tasks.

WebEyes task overview for segmentation, grounding, and VQA. — **Figure 2.** WebEyes contains three task views: Search-based Segmentation, Search-based Grounding, and Search-based VQA.

Statistics	Count
Source images	120
Object annotations	473
Grounding rows	645
Segmentation rows	645
VQA rows	637
Total task instances	1,927

Category Distribution

🏆 Leaderboard

Performance Comparison

Comprehensive evaluation across all three WebEyes tasks. Switch tabs to explore different tasks.

Overall Performance

Category-wise Analysis

🖼️ Showcase

Example 1: Brand Ambassador Identification

Question: "Please find 'the person who became a brand ambassador for the South Korean brand NE:AR in 2025' in the image."

Questions require external knowledge (identifying the ambassador for a specific brand in a given year) before the visible target (Sana) can be localized and segmented.

Original Image

Predicted Mask

📚 Citation

BibTeX

@misc{yang2026webpixelsbringingagentic,
      title={From Web to Pixels: Bringing Agentic Search into Visual Perception}, 
      author={Bokang Yang and Xinyi Sun and Kaituo Feng and Xingping Dong and Dongming Wu and Xiangyu Yue},
      year={2026},
      eprint={2605.12497},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.12497}, 
}