Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,8 +1,8 @@
|
|
| 1 |
---
|
| 2 |
title: Candle Test Arena
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: streamlit
|
| 7 |
sdk_version: 1.44.1
|
| 8 |
app_file: app.py
|
|
@@ -11,4 +11,94 @@ license: apache-2.0
|
|
| 11 |
short_description: Evaluate LLM reasoning capabilities through the Candle Test
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
title: Candle Test Arena
|
| 3 |
+
emoji: π―οΈ
|
| 4 |
+
colorFrom: yellow
|
| 5 |
+
colorTo: gray
|
| 6 |
sdk: streamlit
|
| 7 |
sdk_version: 1.44.1
|
| 8 |
app_file: app.py
|
|
|
|
| 11 |
short_description: Evaluate LLM reasoning capabilities through the Candle Test
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# π―οΈ The Candle Test Arena
|
| 15 |
+
|
| 16 |
+
A Streamlit application for evaluating LLM reasoning capabilities through a simple yet effective test that reveals fundamental limitations in how language models process context and avoid overfitting.
|
| 17 |
+
|
| 18 |
+
## π Overview
|
| 19 |
+
|
| 20 |
+
The Candle Test is a deceptively simple test that reveals a critical limitation in how LLMs process context and avoid overfitting. It was originally proposed by [u/Everlier on Reddit](https://www.reddit.com/r/LocalLLaMA/comments/1jpr1nk/the_candle_test_most_llms_fail_to_generalise_at/) as a way to demonstrate how even sophisticated language models can fall into the trap of overfitting to immediate context.
|
| 21 |
+
|
| 22 |
+
This implementation provides a user-friendly interface to run the test on various LLMs and analyze their performance, helping researchers and developers understand how different models handle context and reasoning.
|
| 23 |
+
|
| 24 |
+
## π― Why This Test Matters
|
| 25 |
+
|
| 26 |
+
The test reveals a fundamental challenge in LLM development: the ability to maintain context while avoiding overfitting to immediate patterns. Many models, even those with sophisticated reasoning capabilities, fail this test by:
|
| 27 |
+
|
| 28 |
+
1. π€ Correctly understanding a basic fact (candles get shorter as they burn)
|
| 29 |
+
2. π§ Holding this fact in context
|
| 30 |
+
3. π― But then failing to avoid overfitting when presented with a riddle that seems to match the context
|
| 31 |
+
|
| 32 |
+
This failure pattern is particularly interesting because it shows that models can understand facts correctly but struggle to apply that understanding flexibly in different contexts.
|
| 33 |
+
|
| 34 |
+
### The Test Sequence
|
| 35 |
+
1. First, we ask if candles get taller or shorter when burning
|
| 36 |
+
2. Then, we confirm the model's understanding
|
| 37 |
+
3. Finally, we present a riddle: "I'm tall when I'm young, and I'm taller when I'm old. What am I?"
|
| 38 |
+
|
| 39 |
+
A model that mentions "candle" in the riddle's answer demonstrates a failure to generalize and a tendency to overfit to the immediate context, despite having correctly understood the original fact about candles.
|
| 40 |
+
|
| 41 |
+
## π Key Features
|
| 42 |
+
|
| 43 |
+
- **Comprehensive Testing**:
|
| 44 |
+
- Test any OpenAI-compatible model
|
| 45 |
+
- Support for both natural language and structured JSON responses
|
| 46 |
+
- Detailed evaluation metrics and comparative statistics
|
| 47 |
+
|
| 48 |
+
- **Results Analysis**:
|
| 49 |
+
- Individual test results with detailed reasoning
|
| 50 |
+
- Comparative performance across models
|
| 51 |
+
- Filtering by model, temperature, and evaluation result
|
| 52 |
+
|
| 53 |
+
- **Data Management**:
|
| 54 |
+
- Export individual test results
|
| 55 |
+
- Download complete test history
|
| 56 |
+
- Cloud synchronization for persistent storage
|
| 57 |
+
|
| 58 |
+
## π οΈ Installation
|
| 59 |
+
|
| 60 |
+
1. Clone the repository:
|
| 61 |
+
```bash
|
| 62 |
+
git clone https://huggingface.co/spaces/k-mktr/candle-test-arena.git
|
| 63 |
+
cd candle-test-arena
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
2. Install dependencies:
|
| 67 |
+
```bash
|
| 68 |
+
pip install -r requirements.txt
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
## π» Usage
|
| 72 |
+
|
| 73 |
+
1. Run the Streamlit app:
|
| 74 |
+
```bash
|
| 75 |
+
streamlit run app.py
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
2. Configure the test:
|
| 79 |
+
- Enter your API key in the sidebar
|
| 80 |
+
- Add models to test (one per line)
|
| 81 |
+
- Choose response format (natural or JSON)
|
| 82 |
+
- Set temperature
|
| 83 |
+
|
| 84 |
+
3. Run the test and analyze results:
|
| 85 |
+
- View individual test results
|
| 86 |
+
- Compare model performance
|
| 87 |
+
- Export results for further analysis
|
| 88 |
+
|
| 89 |
+
## π Results Analysis
|
| 90 |
+
|
| 91 |
+
The app provides three main views:
|
| 92 |
+
|
| 93 |
+
1. **Run Test**: Execute the candle test on selected models
|
| 94 |
+
2. **Results Comparison**: View comparative statistics across models
|
| 95 |
+
3. **Results Browser**: Browse and filter individual test results
|
| 96 |
+
|
| 97 |
+
## π License
|
| 98 |
+
|
| 99 |
+
This project is licensed under the Apache License 2.0.
|
| 100 |
+
|
| 101 |
+
## π Credits
|
| 102 |
+
|
| 103 |
+
- Original test concept by [u/Everlier](https://www.reddit.com/user/Everlier/)
|
| 104 |
+
- [Original Reddit discussion](https://www.reddit.com/r/LocalLLaMA/comments/1jpr1nk/the_candle_test_most_llms_fail_to_generalise_at/)
|