Spaces:

k-mktr
/

candle-test-arena

Running

App Files Files Community

candle-test-arena / README.md

k-mktr

Update README.md

b86d384 verified 8 months ago

preview code

raw

history blame contribute delete

3.95 kB

	---
	title: Candle Test Arena
	emoji: 🕯️
	colorFrom: yellow
	colorTo: gray
	sdk: streamlit
	sdk_version: 1.44.1
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: Evaluate LLM reasoning capabilities through the Candle Test
	---

	# 🕯️ The Candle Test Arena

	A Streamlit application for evaluating LLM reasoning capabilities through a simple yet effective test that reveals fundamental limitations in how language models process context and avoid overfitting.

	## 📋 Overview

	The Candle Test is a deceptively simple test that reveals a critical limitation in how LLMs process context and avoid overfitting. It was originally proposed by [u/Everlier on Reddit](https://www.reddit.com/r/LocalLLaMA/comments/1jpr1nk/the_candle_test_most_llms_fail_to_generalise_at/) as a way to demonstrate how even sophisticated language models can fall into the trap of overfitting to immediate context.

	This implementation provides a user-friendly interface to run the test on various LLMs and analyze their performance, helping researchers and developers understand how different models handle context and reasoning.

	## 🎯 Why This Test Matters

	The test reveals a fundamental challenge in LLM development: the ability to maintain context while avoiding overfitting to immediate patterns. Many models, even those with sophisticated reasoning capabilities, fail this test by:

	1. 🤔 Correctly understanding a basic fact (candles get shorter as they burn)
	2. 🧠 Holding this fact in context
	3. 🎯 But then failing to avoid overfitting when presented with a riddle that seems to match the context

	This failure pattern is particularly interesting because it shows that models can understand facts correctly but struggle to apply that understanding flexibly in different contexts.

	### The Test Sequence
	1. First, we ask if candles get taller or shorter when burning
	2. Then, we confirm the model's understanding
	3. Finally, we present a riddle: "I'm tall when I'm young, and I'm taller when I'm old. What am I?"

	A model that mentions "candle" in the riddle's answer demonstrates a failure to generalize and a tendency to overfit to the immediate context, despite having correctly understood the original fact about candles.

	## 🚀 Key Features

	- Comprehensive Testing:
	- Test any OpenAI-compatible model
	- Support for both natural language and structured JSON responses
	- Detailed evaluation metrics and comparative statistics

	- Results Analysis:
	- Individual test results with detailed reasoning
	- Comparative performance across models
	- Filtering by model, temperature, and evaluation result

	- Data Management:
	- Export individual test results
	- Download complete test history
	- Cloud synchronization for persistent storage

	## 🛠️ Installation

	1. Clone the repository:
	```bash
	git clone https://huggingface.co/spaces/k-mktr/candle-test-arena.git
	cd candle-test-arena
	```

	2. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	## 💻 Usage

	1. Run the Streamlit app:
	```bash
	streamlit run app.py
	```

	2. Configure the test:
	- Enter your API key in the sidebar
	- Add models to test (one per line)
	- Choose response format (natural or JSON)
	- Set temperature

	3. Run the test and analyze results:
	- View individual test results
	- Compare model performance
	- Export results for further analysis

	## 📊 Results Analysis

	The app provides three main views:

	1. Run Test: Execute the candle test on selected models
	2. Results Comparison: View comparative statistics across models
	3. Results Browser: Browse and filter individual test results

	## 📝 License

	This project is licensed under the Apache License 2.0.

	## 🙏 Credits

	- Original test concept by [u/Everlier](https://www.reddit.com/user/Everlier/)
	- [Original Reddit discussion](https://www.reddit.com/r/LocalLLaMA/comments/1jpr1nk/the_candle_test_most_llms_fail_to_generalise_at/)