k-mktr commited on
Commit
b86d384
Β·
verified Β·
1 Parent(s): 4836176

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -4
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
  title: Candle Test Arena
3
- emoji: ⚑
4
- colorFrom: gray
5
- colorTo: yellow
6
  sdk: streamlit
7
  sdk_version: 1.44.1
8
  app_file: app.py
@@ -11,4 +11,94 @@ license: apache-2.0
11
  short_description: Evaluate LLM reasoning capabilities through the Candle Test
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Candle Test Arena
3
+ emoji: πŸ•―οΈ
4
+ colorFrom: yellow
5
+ colorTo: gray
6
  sdk: streamlit
7
  sdk_version: 1.44.1
8
  app_file: app.py
 
11
  short_description: Evaluate LLM reasoning capabilities through the Candle Test
12
  ---
13
 
14
+ # πŸ•―οΈ The Candle Test Arena
15
+
16
+ A Streamlit application for evaluating LLM reasoning capabilities through a simple yet effective test that reveals fundamental limitations in how language models process context and avoid overfitting.
17
+
18
+ ## πŸ“‹ Overview
19
+
20
+ The Candle Test is a deceptively simple test that reveals a critical limitation in how LLMs process context and avoid overfitting. It was originally proposed by [u/Everlier on Reddit](https://www.reddit.com/r/LocalLLaMA/comments/1jpr1nk/the_candle_test_most_llms_fail_to_generalise_at/) as a way to demonstrate how even sophisticated language models can fall into the trap of overfitting to immediate context.
21
+
22
+ This implementation provides a user-friendly interface to run the test on various LLMs and analyze their performance, helping researchers and developers understand how different models handle context and reasoning.
23
+
24
+ ## 🎯 Why This Test Matters
25
+
26
+ The test reveals a fundamental challenge in LLM development: the ability to maintain context while avoiding overfitting to immediate patterns. Many models, even those with sophisticated reasoning capabilities, fail this test by:
27
+
28
+ 1. πŸ€” Correctly understanding a basic fact (candles get shorter as they burn)
29
+ 2. 🧠 Holding this fact in context
30
+ 3. 🎯 But then failing to avoid overfitting when presented with a riddle that seems to match the context
31
+
32
+ This failure pattern is particularly interesting because it shows that models can understand facts correctly but struggle to apply that understanding flexibly in different contexts.
33
+
34
+ ### The Test Sequence
35
+ 1. First, we ask if candles get taller or shorter when burning
36
+ 2. Then, we confirm the model's understanding
37
+ 3. Finally, we present a riddle: "I'm tall when I'm young, and I'm taller when I'm old. What am I?"
38
+
39
+ A model that mentions "candle" in the riddle's answer demonstrates a failure to generalize and a tendency to overfit to the immediate context, despite having correctly understood the original fact about candles.
40
+
41
+ ## πŸš€ Key Features
42
+
43
+ - **Comprehensive Testing**:
44
+ - Test any OpenAI-compatible model
45
+ - Support for both natural language and structured JSON responses
46
+ - Detailed evaluation metrics and comparative statistics
47
+
48
+ - **Results Analysis**:
49
+ - Individual test results with detailed reasoning
50
+ - Comparative performance across models
51
+ - Filtering by model, temperature, and evaluation result
52
+
53
+ - **Data Management**:
54
+ - Export individual test results
55
+ - Download complete test history
56
+ - Cloud synchronization for persistent storage
57
+
58
+ ## πŸ› οΈ Installation
59
+
60
+ 1. Clone the repository:
61
+ ```bash
62
+ git clone https://huggingface.co/spaces/k-mktr/candle-test-arena.git
63
+ cd candle-test-arena
64
+ ```
65
+
66
+ 2. Install dependencies:
67
+ ```bash
68
+ pip install -r requirements.txt
69
+ ```
70
+
71
+ ## πŸ’» Usage
72
+
73
+ 1. Run the Streamlit app:
74
+ ```bash
75
+ streamlit run app.py
76
+ ```
77
+
78
+ 2. Configure the test:
79
+ - Enter your API key in the sidebar
80
+ - Add models to test (one per line)
81
+ - Choose response format (natural or JSON)
82
+ - Set temperature
83
+
84
+ 3. Run the test and analyze results:
85
+ - View individual test results
86
+ - Compare model performance
87
+ - Export results for further analysis
88
+
89
+ ## πŸ“Š Results Analysis
90
+
91
+ The app provides three main views:
92
+
93
+ 1. **Run Test**: Execute the candle test on selected models
94
+ 2. **Results Comparison**: View comparative statistics across models
95
+ 3. **Results Browser**: Browse and filter individual test results
96
+
97
+ ## πŸ“ License
98
+
99
+ This project is licensed under the Apache License 2.0.
100
+
101
+ ## πŸ™ Credits
102
+
103
+ - Original test concept by [u/Everlier](https://www.reddit.com/user/Everlier/)
104
+ - [Original Reddit discussion](https://www.reddit.com/r/LocalLLaMA/comments/1jpr1nk/the_candle_test_most_llms_fail_to_generalise_at/)