WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents
Abstract
A comprehensive benchmark study evaluates the detection of prompt injection attacks against web agents, revealing that current detectors perform well against explicit attacks but struggle with subtle ones.
Multiple prompt injection attacks have been proposed against web agents. At the same time, various methods have been developed to detect general prompt injection attacks, but none have been systematically evaluated for web agents. In this work, we bridge this gap by presenting the first comprehensive benchmark study on detecting prompt injection attacks targeting web agents. We begin by introducing a fine-grained categorization of such attacks based on the threat model. We then construct datasets containing both malicious and benign samples: malicious text segments generated by different attacks, benign text segments from four categories, malicious images produced by attacks, and benign images from two categories. Next, we systematize both text-based and image-based detection methods. Finally, we evaluate their performance across multiple scenarios. Our key findings show that while some detectors can identify attacks that rely on explicit textual instructions or visible image perturbations with moderate to high accuracy, they largely fail against attacks that omit explicit instructions or employ imperceptible perturbations. Our datasets and code are released at: https://github.com/Norrrrrrr-lyn/WAInjectBench.
Community
This paper presents WAInjectBench, a benchmark for evaluating prompt injection detection methods in web agents. It covers diverse attack strategies in both text and image, and provides datasets plus systematic evaluation of existing detectors. Code and data are available here: https://github.com/Norrrrrrr-lyn/WAInjectBench
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance (2025)
- SecInfer: Preventing Prompt Injection via Inference-time Scaling (2025)
- AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema (2025)
- Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts (2025)
- Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling (2025)
- A Whole New World: Creating a Parallel-Poisoned Web Only AI-Agents Can See (2025)
- Decoding Latent Attack Surfaces in LLMs: Prompt Injection via HTML in Web Summarization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper