llmlagbench / content.py
fzarnecki's picture
Initial commit
f97f900
raw
history blame
4.12 kB
"""
Content text for the LLMLagBench application.
Contains descriptive text for various sections of the UI.
"""
# Section under main title
LLMLAGBENCH_INTRO = """
Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff, creating
a **strict knowledge boundary** beyond which models cannot provide accurate information without querying
external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend
outdated time-sensitive information with general knowledge during reasoning tasks, **potentially
compromising response accuracy**.
LLMLagBench provides a systematic approach for **identifying the earliest probable temporal boundaries** of
an LLM's training data by evaluating its knowledge of recent events. The benchmark comprises of **1,700+ curated questions** about events sampled from news reports published between 2020-2025 (we plan to update the question set regularly). Each
question could not be accurately answered before the event was reported in news media. We evaluate model
responses using a **0-2 scale faithfulness metric** and apply the **PELT (Pruned Exact Linear Time)** changepoint
detection algorithm to identify where model performance exhibits statistically significant drops,
revealing their actual knowledge cutoffs.
Our analysis of major LLMs reveals that knowledge infusion operates differently across training phases,
often resulting in multiple partial cutoff points rather than a single sharp boundary. **Provider-declared
cutoffs** and model self-reports **frequently diverge** from empirically detected boundaries by months or even
years, underscoring the necessity of independent empirical validation.
"""
# Section above the leaderboard table
LEADERBOARD_INTRO = """
The leaderboard below ranks models by their **Overall Average** faithfulness score (0-2 scale) across
all 1,700+ questions spanning 2020-2025. The table also displays **Provider Cutoff** dates as declared
by model developers, **1st and 2nd Detected Cutoffs** identified by LLMLagBench's PELT algorithm,
and additional metadata including release dates and model parameters. **Notable discrepancies** between
provider-declared cutoffs and empirically detected cutoffs reveal cases **where models' actual
knowledge boundaries differ significantly from official declarations** — sometimes by months or even years.
"""
# Section for Model Comparison
MODEL_COMPARISON_INTRO = """
The visualizations below present detailed per-model analysis using the PELT (Pruned Exact Linear Time)
changepoint detection algorithm to **identify significant shifts in faithfulness** scores over time.
- **Blue scatter points** represent individual faithfulness scores (0-2 scale, left y-axis) for questions ordered
by event date.
- **Red horizontal lines** indicate mean faithfulness within segments identified by PELT, with
red dashed vertical lines marking detected changepoints—possible training boundaries where performance
characteristics shift.
- **The green curve** shows cumulative average faithfulness over time.
- **The orange line** (right y-axis) tracks cumulative refusals, revealing how often models decline to answer
questions beyond their knowledge boundaries.
- **Yellow percentage labels** indicate refusal rates within each
segment—models instruction-tuned to acknowledge their limitations exhibit sharp increases in refusals
after cutoff dates, while others continue attempting answers despite lacking relevant training data,
potentially leading to hallucination.
Select models to compare how different architectures and training approaches handle temporal knowledge
boundaries. Some models exhibit single sharp cutoffs, while others show multiple partial boundaries
possibly corresponding to distinct pretraining, continued pretraining, and post-training phases.
"""
AUTHORS = """
<div style='text-align: center; font-size: 0.9em; color: #666; margin-top: 5px; margin-bottom: 15px;'>
<em>Piotr Pęzik, Konrad Kaczyński, Maria Szymańska, Filip Żarnecki, Zuzanna Deckert, Jakub Kwiatkowski, Wojciech Janowski</em>
</div>
"""