added README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,38 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: streamlit
|
| 7 |
sdk_version: 1.44.1
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Adversarial Policy Probe
|
| 3 |
+
emoji: 🔒
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: purple
|
| 6 |
sdk: streamlit
|
| 7 |
sdk_version: 1.44.1
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
+
|
| 12 |
+
# Adversarial Policy Probe
|
| 13 |
+
|
| 14 |
+
A Streamlit application designed to probe large-language-model policy-violation defenses by generating corrupted user prompts and classifying model replies for disallowed content.
|
| 15 |
+
|
| 16 |
+
> **Credit:** This tool is based on the “Best-of-N Jailbreak” methodology introduced in the Anthropic-led paper *Best-of-N Jailbreaking* by Hughes *et al.* (2024).[^1] [oai_citation:0‡arxiv.org](https://arxiv.org/abs/2412.03556?utm_source=chatgpt.com)
|
| 17 |
+
|
| 18 |
+
## 🚀 Features
|
| 19 |
+
|
| 20 |
+
- **Prompt-corruption helpers**
|
| 21 |
+
- Word scrambling (`apply_word_scrambling`)
|
| 22 |
+
- Random capitalization (`apply_random_caps`)
|
| 23 |
+
- ASCII-level noise injection (`apply_ascii_noise`)
|
| 24 |
+
- **One-token classifier prompt**
|
| 25 |
+
- Strict **YES/NO** output for policy-violation detection
|
| 26 |
+
- **Model loader with quantization & MPS support**
|
| 27 |
+
- 8-bit / 4-bit quantization via `BitsAndBytesConfig`
|
| 28 |
+
- Automatic device mapping on CUDA, fallback to full precision on MPS/CPU
|
| 29 |
+
- **Adversarial attack loop**
|
| 30 |
+
- Batch-driven corruption of the seed prompt
|
| 31 |
+
- Generates & classifies replies, tracks successful policy violations
|
| 32 |
+
- **Streamlit UI**
|
| 33 |
+
- Interactive sidebar controls: model, device, quantization, σ, iterations, batch size, seed
|
| 34 |
+
- Real-time progress bar & status updates
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
[^1]: Hughes, J., Price, S., Lynch, A., Schaeffer, R., Barez, F., Koyejo, S., Sleight, H., Jones, E., Perez, E., & Sharma, M. (2024). *Best-of-N Jailbreaking*. arXiv:2412.03556.
|