Spaces:

k-mktr
/

candle-test-arena

Sleeping

App Files Files Community

k-mktr commited on Apr 8

Commit

b86d384

verified ·

1 Parent(s): 4836176

Update README.md

Browse files

Files changed (1) hide show

README.md +94 -4

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 ---
 title: Candle Test Arena
-emoji: ⚡
-colorFrom: gray
-colorTo: yellow
 sdk: streamlit
 sdk_version: 1.44.1
 app_file: app.py
@@ -11,4 +11,94 @@ license: apache-2.0
 short_description: Evaluate LLM reasoning capabilities through the Candle Test
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: Candle Test Arena
+emoji: 🕯️
+colorFrom: yellow
+colorTo: gray
 sdk: streamlit
 sdk_version: 1.44.1
 app_file: app.py
 short_description: Evaluate LLM reasoning capabilities through the Candle Test
 ---
+# 🕯️ The Candle Test Arena
+A Streamlit application for evaluating LLM reasoning capabilities through a simple yet effective test that reveals fundamental limitations in how language models process context and avoid overfitting.
+## 📋 Overview
+The Candle Test is a deceptively simple test that reveals a critical limitation in how LLMs process context and avoid overfitting. It was originally proposed by [u/Everlier on Reddit](https://www.reddit.com/r/LocalLLaMA/comments/1jpr1nk/the_candle_test_most_llms_fail_to_generalise_at/) as a way to demonstrate how even sophisticated language models can fall into the trap of overfitting to immediate context.
+This implementation provides a user-friendly interface to run the test on various LLMs and analyze their performance, helping researchers and developers understand how different models handle context and reasoning.
+## 🎯 Why This Test Matters
+The test reveals a fundamental challenge in LLM development: the ability to maintain context while avoiding overfitting to immediate patterns. Many models, even those with sophisticated reasoning capabilities, fail this test by:
+1. 🤔 Correctly understanding a basic fact (candles get shorter as they burn)
+2. 🧠 Holding this fact in context
+3. 🎯 But then failing to avoid overfitting when presented with a riddle that seems to match the context
+This failure pattern is particularly interesting because it shows that models can understand facts correctly but struggle to apply that understanding flexibly in different contexts.
+### The Test Sequence
+1. First, we ask if candles get taller or shorter when burning
+2. Then, we confirm the model's understanding
+3. Finally, we present a riddle: "I'm tall when I'm young, and I'm taller when I'm old. What am I?"
+A model that mentions "candle" in the riddle's answer demonstrates a failure to generalize and a tendency to overfit to the immediate context, despite having correctly understood the original fact about candles.
+## 🚀 Key Features
+- **Comprehensive Testing**:
+  - Test any OpenAI-compatible model
+  - Support for both natural language and structured JSON responses
+  - Detailed evaluation metrics and comparative statistics
+- **Results Analysis**:
+  - Individual test results with detailed reasoning
+  - Comparative performance across models
+  - Filtering by model, temperature, and evaluation result
+- **Data Management**:
+  - Export individual test results
+  - Download complete test history
+  - Cloud synchronization for persistent storage
+## 🛠️ Installation
+1. Clone the repository:
+```bash
+git clone https://huggingface.co/spaces/k-mktr/candle-test-arena.git
+cd candle-test-arena
+```
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+## 💻 Usage
+1. Run the Streamlit app:
+```bash
+streamlit run app.py
+```
+2. Configure the test:
+   - Enter your API key in the sidebar
+   - Add models to test (one per line)
+   - Choose response format (natural or JSON)
+   - Set temperature
+3. Run the test and analyze results:
+   - View individual test results
+   - Compare model performance
+   - Export results for further analysis
+## 📊 Results Analysis
+The app provides three main views:
+1. **Run Test**: Execute the candle test on selected models
+2. **Results Comparison**: View comparative statistics across models
+3. **Results Browser**: Browse and filter individual test results
+## 📝 License
+This project is licensed under the Apache License 2.0.
+## 🙏 Credits
+- Original test concept by [u/Everlier](https://www.reddit.com/user/Everlier/)
+- [Original Reddit discussion](https://www.reddit.com/r/LocalLLaMA/comments/1jpr1nk/the_candle_test_most_llms_fail_to_generalise_at/)