Spaces:
Running
Running
| title: Candle Test Arena | |
| emoji: π―οΈ | |
| colorFrom: yellow | |
| colorTo: gray | |
| sdk: streamlit | |
| sdk_version: 1.44.1 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: Evaluate LLM reasoning capabilities through the Candle Test | |
| # π―οΈ The Candle Test Arena | |
| A Streamlit application for evaluating LLM reasoning capabilities through a simple yet effective test that reveals fundamental limitations in how language models process context and avoid overfitting. | |
| ## π Overview | |
| The Candle Test is a deceptively simple test that reveals a critical limitation in how LLMs process context and avoid overfitting. It was originally proposed by [u/Everlier on Reddit](https://www.reddit.com/r/LocalLLaMA/comments/1jpr1nk/the_candle_test_most_llms_fail_to_generalise_at/) as a way to demonstrate how even sophisticated language models can fall into the trap of overfitting to immediate context. | |
| This implementation provides a user-friendly interface to run the test on various LLMs and analyze their performance, helping researchers and developers understand how different models handle context and reasoning. | |
| ## π― Why This Test Matters | |
| The test reveals a fundamental challenge in LLM development: the ability to maintain context while avoiding overfitting to immediate patterns. Many models, even those with sophisticated reasoning capabilities, fail this test by: | |
| 1. π€ Correctly understanding a basic fact (candles get shorter as they burn) | |
| 2. π§ Holding this fact in context | |
| 3. π― But then failing to avoid overfitting when presented with a riddle that seems to match the context | |
| This failure pattern is particularly interesting because it shows that models can understand facts correctly but struggle to apply that understanding flexibly in different contexts. | |
| ### The Test Sequence | |
| 1. First, we ask if candles get taller or shorter when burning | |
| 2. Then, we confirm the model's understanding | |
| 3. Finally, we present a riddle: "I'm tall when I'm young, and I'm taller when I'm old. What am I?" | |
| A model that mentions "candle" in the riddle's answer demonstrates a failure to generalize and a tendency to overfit to the immediate context, despite having correctly understood the original fact about candles. | |
| ## π Key Features | |
| - **Comprehensive Testing**: | |
| - Test any OpenAI-compatible model | |
| - Support for both natural language and structured JSON responses | |
| - Detailed evaluation metrics and comparative statistics | |
| - **Results Analysis**: | |
| - Individual test results with detailed reasoning | |
| - Comparative performance across models | |
| - Filtering by model, temperature, and evaluation result | |
| - **Data Management**: | |
| - Export individual test results | |
| - Download complete test history | |
| - Cloud synchronization for persistent storage | |
| ## π οΈ Installation | |
| 1. Clone the repository: | |
| ```bash | |
| git clone https://huggingface.co/spaces/k-mktr/candle-test-arena.git | |
| cd candle-test-arena | |
| ``` | |
| 2. Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## π» Usage | |
| 1. Run the Streamlit app: | |
| ```bash | |
| streamlit run app.py | |
| ``` | |
| 2. Configure the test: | |
| - Enter your API key in the sidebar | |
| - Add models to test (one per line) | |
| - Choose response format (natural or JSON) | |
| - Set temperature | |
| 3. Run the test and analyze results: | |
| - View individual test results | |
| - Compare model performance | |
| - Export results for further analysis | |
| ## π Results Analysis | |
| The app provides three main views: | |
| 1. **Run Test**: Execute the candle test on selected models | |
| 2. **Results Comparison**: View comparative statistics across models | |
| 3. **Results Browser**: Browse and filter individual test results | |
| ## π License | |
| This project is licensed under the Apache License 2.0. | |
| ## π Credits | |
| - Original test concept by [u/Everlier](https://www.reddit.com/user/Everlier/) | |
| - [Original Reddit discussion](https://www.reddit.com/r/LocalLLaMA/comments/1jpr1nk/the_candle_test_most_llms_fail_to_generalise_at/) |