MottaCC commited on
Commit
3ecc4a4
·
1 Parent(s): 815bb33

added README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -4
README.md CHANGED
@@ -1,10 +1,38 @@
1
  ---
2
- title: Mistral 7B LoRA DMT
3
- emoji: 🌍
4
- colorFrom: yellow
5
- colorTo: red
6
  sdk: streamlit
7
  sdk_version: 1.44.1
8
  app_file: app.py
9
  pinned: false
10
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Adversarial Policy Probe
3
+ emoji: 🔒
4
+ colorFrom: indigo
5
+ colorTo: purple
6
  sdk: streamlit
7
  sdk_version: 1.44.1
8
  app_file: app.py
9
  pinned: false
10
  ---
11
+
12
+ # Adversarial Policy Probe
13
+
14
+ A Streamlit application designed to probe large-language-model policy-violation defenses by generating corrupted user prompts and classifying model replies for disallowed content.
15
+
16
+ > **Credit:** This tool is based on the “Best-of-N Jailbreak” methodology introduced in the Anthropic-led paper *Best-of-N Jailbreaking* by Hughes *et al.* (2024).[^1] [oai_citation:0‡arxiv.org](https://arxiv.org/abs/2412.03556?utm_source=chatgpt.com)
17
+
18
+ ## 🚀 Features
19
+
20
+ - **Prompt-corruption helpers**
21
+ - Word scrambling (`apply_word_scrambling`)
22
+ - Random capitalization (`apply_random_caps`)
23
+ - ASCII-level noise injection (`apply_ascii_noise`)
24
+ - **One-token classifier prompt**
25
+ - Strict **YES/NO** output for policy-violation detection
26
+ - **Model loader with quantization & MPS support**
27
+ - 8-bit / 4-bit quantization via `BitsAndBytesConfig`
28
+ - Automatic device mapping on CUDA, fallback to full precision on MPS/CPU
29
+ - **Adversarial attack loop**
30
+ - Batch-driven corruption of the seed prompt
31
+ - Generates & classifies replies, tracks successful policy violations
32
+ - **Streamlit UI**
33
+ - Interactive sidebar controls: model, device, quantization, σ, iterations, batch size, seed
34
+ - Real-time progress bar & status updates
35
+
36
+ ---
37
+
38
+ [^1]: Hughes, J., Price, S., Lynch, A., Schaeffer, R., Barez, F., Koyejo, S., Sleight, H., Jones, E., Perez, E., & Sharma, M. (2024). *Best-of-N Jailbreaking*. arXiv:2412.03556.