Spaces:
Running
Running
| """ | |
| Content text for the LLMLagBench application. | |
| Contains descriptive text for various sections of the UI. | |
| """ | |
| # Section under main title | |
| LLMLAGBENCH_INTRO = """ | |
| Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff, creating | |
| a **strict knowledge boundary** beyond which models cannot provide accurate information without querying | |
| external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend | |
| outdated time-sensitive information with general knowledge during reasoning tasks, **potentially | |
| compromising response accuracy**. | |
| LLMLagBench provides a systematic approach for **identifying the earliest probable temporal boundaries** of | |
| an LLM's training data by evaluating its knowledge of recent events. The benchmark comprises of **1,700+ curated questions** about events sampled from news reports published between 2020-2025 (we plan to update the question set regularly). Each | |
| question could not be accurately answered before the event was reported in news media. We evaluate model | |
| responses using a **0-2 scale faithfulness metric** and apply the **PELT (Pruned Exact Linear Time)** changepoint | |
| detection algorithm to identify where model performance exhibits statistically significant drops, | |
| revealing their actual knowledge cutoffs. | |
| Our analysis of major LLMs reveals that knowledge infusion operates differently across training phases, | |
| often resulting in multiple partial cutoff points rather than a single sharp boundary. **Provider-declared | |
| cutoffs** and model self-reports **frequently diverge** from empirically detected boundaries by months or even | |
| years, underscoring the necessity of independent empirical validation. | |
| """ | |
| # Section above the leaderboard table | |
| LEADERBOARD_INTRO = """ | |
| The leaderboard below ranks models by their **Overall Average** faithfulness score (0-2 scale) across | |
| all 1,700+ questions spanning 2020-2025. The table also displays **Provider Cutoff** dates as declared | |
| by model developers, **1st and 2nd Detected Cutoffs** identified by LLMLagBench's PELT algorithm, | |
| and additional metadata including release dates and model parameters. **Notable discrepancies** between | |
| provider-declared cutoffs and empirically detected cutoffs reveal cases **where models' actual | |
| knowledge boundaries differ significantly from official declarations** — sometimes by months or even years. | |
| """ | |
| # Section for Model Comparison | |
| MODEL_COMPARISON_INTRO = """ | |
| The visualizations below present detailed per-model analysis using the PELT (Pruned Exact Linear Time) | |
| changepoint detection algorithm to **identify significant shifts in faithfulness** scores over time. | |
| - **Blue scatter points** represent individual faithfulness scores (0-2 scale, left y-axis) for questions ordered | |
| by event date. | |
| - **Red horizontal lines** indicate mean faithfulness within segments identified by PELT, with | |
| red dashed vertical lines marking detected changepoints—possible training boundaries where performance | |
| characteristics shift. | |
| - **The green curve** shows cumulative average faithfulness over time. | |
| - **The orange line** (right y-axis) tracks cumulative refusals, revealing how often models decline to answer | |
| questions beyond their knowledge boundaries. | |
| - **Yellow percentage labels** indicate refusal rates within each | |
| segment—models instruction-tuned to acknowledge their limitations exhibit sharp increases in refusals | |
| after cutoff dates, while others continue attempting answers despite lacking relevant training data, | |
| potentially leading to hallucination. | |
| Select models to compare how different architectures and training approaches handle temporal knowledge | |
| boundaries. Some models exhibit single sharp cutoffs, while others show multiple partial boundaries | |
| possibly corresponding to distinct pretraining, continued pretraining, and post-training phases. | |
| """ | |
| AUTHORS = """ | |
| <div style='text-align: center; font-size: 0.9em; color: #666; margin-top: 5px; margin-bottom: 15px;'> | |
| <em>Piotr Pęzik, Konrad Kaczyński, Maria Szymańska, Filip Żarnecki, Zuzanna Deckert, Jakub Kwiatkowski, Wojciech Janowski</em> | |
| </div> | |
| """ |