Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System
Abstract
A framework is developed to enhance security in LLM-based autonomous agents through detection and countermeasure strategies for advanced attacks like many-shot jailbreaking and deceptive alignment.
The autonomous AI agents using large language models can create undeniable values in all span of the society but they face security threats from adversaries that warrants immediate protective solutions because trust and safety issues arise. Considering the many-shot jailbreaking and deceptive alignment as some of the main advanced attacks, that cannot be mitigated by the static guardrails used during the supervised training, points out a crucial research priority for real world robustness. The combination of static guardrails in dynamic multi-agent system fails to defend against those attacks. We intend to enhance security for LLM-based agents through the development of new evaluation frameworks which identify and counter threats for safe operational deployment. Our work uses three examination methods to detect rogue agents through a Reverse Turing Test and analyze deceptive alignment through multi-agent simulations and develops an anti-jailbreaking system by testing it with GEMINI 1.5 pro and llama-3.3-70B, deepseek r1 models using tool-mediated adversarial scenarios. The detection capabilities are strong such as 94\% accuracy for GEMINI 1.5 pro yet the system suffers persistent vulnerabilities when under long attacks as prompt length increases attack success rates (ASR) and diversity metrics become ineffective in prediction while revealing multiple complex system faults. The findings demonstrate the necessity of adopting flexible security systems based on active monitoring that can be performed by the agents themselves together with adaptable interventions by system admin as the current models can create vulnerabilities that can lead to the unreliable and vulnerable system. So, in our work, we try to address such situations and propose a comprehensive framework to counteract the security issues.
Community
What if agentic systems could defend themselves from adversarial attacks?
⨠In our paper, we experimented with and developed a defensive system for AI agents to protect themselves against adversarial prompt attacks and jailbreak attempts. We demonstrated that our multi-agentic systems can:
đ Detect and prevent prompt injection attacks
đłď¸ Maintain operational boundaries
⥠Filter out malicious instructions
âď¸ Preserve their core directives
đ ď¸ Self-validate responses
This will have a significant impact in the near term, especially as agentic systems are required to function autonomously without human supervision. âď¸
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Peering Behind the Shield: Guardrail Identification in Large Language Models (2025)
- Automating Prompt Leakage Attacks on Large Language Models Using Agentic Approach (2025)
- The TIP of the Iceberg: Revealing a Hidden Class of Task-In-Prompt Adversarial Attacks on LLMs (2025)
- Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense (2025)
- Foot-In-The-Door: A Multi-turn Jailbreak for LLMs (2025)
- Red-Teaming LLM Multi-Agent Systems via Communication Attacks (2025)
- Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
 Saikat Barua
							Saikat Barua 
					 
					 
					 
					 
					 
						