RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search
Abstract
RainbowPlus, an evolutionary computation-based red-teaming framework, enhances adversarial prompt generation for LLMs, improving attack success rate and diversity compared to existing methods.
Large Language Models (LLMs) exhibit remarkable capabilities but are susceptible to adversarial prompts that exploit vulnerabilities to produce unsafe or biased outputs. Existing red-teaming methods often face scalability challenges, resource-intensive requirements, or limited diversity in attack strategies. We propose RainbowPlus, a novel red-teaming framework rooted in evolutionary computation, enhancing adversarial prompt generation through an adaptive quality-diversity (QD) search that extends classical evolutionary algorithms like MAP-Elites with innovations tailored for language models. By employing a multi-element archive to store diverse high-quality prompts and a comprehensive fitness function to evaluate multiple prompts concurrently, RainbowPlus overcomes the constraints of single-prompt archives and pairwise comparisons in prior QD methods like Rainbow Teaming. Experiments comparing RainbowPlus to QD methods across six benchmark datasets and four open-source LLMs demonstrate superior attack success rate (ASR) and diversity (Diverse-Score approx 0.84), generating up to 100 times more unique prompts (e.g., 10,418 vs. 100 for Ministral-8B-Instruct-2410). Against nine state-of-the-art methods on the HarmBench dataset with twelve LLMs (ten open-source, two closed-source), RainbowPlus achieves an average ASR of 81.1%, surpassing AutoDAN-Turbo by 3.9%, and is 9 times faster (1.45 vs. 13.50 hours). Our open-source implementation fosters further advancements in LLM safety, offering a scalable tool for vulnerability assessment. Code and resources are publicly available at https://github.com/knoveleng/rainbowplus, supporting reproducibility and future research in LLM red-teaming.
Community
Very happy to share our work to the community!
Nice work!
I observed that the number of mutations varies between the two experimental setups. Could you explain how to determine the optimal number of mutations?
Hi @thucdangvan020999 ,
Thanks for your good question!
Increasing the number of mutations can boost the attack success rate. However, to ensure a fair comparison with other methods and manage costs, we limited it to 10 in Experiment 2. For practical applications, we strongly suggest raising the number to achieve better outcomes.
Best regards
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking (2025)
- ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content (2025)
- Representation Bending for Large Language Model Safety (2025)
- Improving LLM Safety Alignment with Dual-Objective Optimization (2025)
- Survey of Adversarial Robustness in Multimodal Large Language Models (2025)
- Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks (2025)
- Reinforced Diffuser for Red Teaming Large Vision-Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
I have a few questions
Mutation Count Rationale:
In your method, you apply 20 mutations per prompt. Could you explain the rationale behind choosing 20 specifically?
ASR Evaluation via HarmBench:
For calculating Adversarial Success Rate (ASR), did you use only HarmBench’s dataset, or did you also rely on their classifiers.
Iteration Limit Justification:
You mention running the evolutionary process for only 400 iterations during your HarmBench comparison. Could you tell what prompted the choice of 400?
Rainbow Teaming Reimplementation:
Since Rainbow Teaming is not open-source, did you reimplement it from scratch? If so, did you follow the exact setup described in the  paper? Were there any assumptions or approximations you had to make in the absence of released code?
Hi @nj724 , here is our responses for your questions. Please feel free if you have more questions
- Mutation Count Rationale: More mutations typically lead to a larger candidate set and higher chances of selecting effective adversarial prompts. In "Comparison to Rainbow Method" section, we use 20 mutations to adequately demonstrate the performance advantage of RAINBOWPLUS over Rainbow Teaming.
- ASR Evaluation via HarmBench: - We don't use their classifier, we use Llama-Guard 8B to classify
- Unlike Rainbow Teaming, we do not replace Risk Categories with HarmBench prompts. As detailed in the
 "Experiment Setup" of "Comparison to State-of-the-Art Methods" section, each prompt in the HarmBench experiment is generated through a single iteration, using a specific combination of Risk Category and Attack Style. This setup does not leverage historical context, which is one of RAINBOWPLUS’s core strengths, and may slightly limit its performance. However, it ensures that the generated prompts remain aligned with the harmful behaviors targeted by the original HarmBench dataset. For this reason, we set the number of iterations equal to the number of samples (400) in HarmBench.
 
- Rainbow Teaming Reimplementation: We implemented the Rainbow Teaming algorithm strictly according to the pseudocode provided in Algorithm 3, as introduced in (Samvelyan et al. 2024).
Thanks for your reply, but I am still a bit unsure about the harmbench comparison did you take 400 seed prompts for the run when you say , each prompt in the HarmBench experiment is generated through a single iteration, using a specific combination of Risk Category and Attack Style?
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
 
					 
					 
					 
						
