Spaces:

pelcra
/

llmlagbench

Running

App Files Files Community

llmlagbench / content.py

fzarnecki

Initial commit

f97f900 about 1 month ago

raw

history blame

4.12 kB

	"""
	Content text for the LLMLagBench application.
	Contains descriptive text for various sections of the UI.
	"""

	# Section under main title
	LLMLAGBENCH_INTRO = """
	Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff, creating
	a strict knowledge boundary beyond which models cannot provide accurate information without querying
	external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend
	outdated time-sensitive information with general knowledge during reasoning tasks, **potentially
	compromising response accuracy**.

	LLMLagBench provides a systematic approach for identifying the earliest probable temporal boundaries of
	an LLM's training data by evaluating its knowledge of recent events. The benchmark comprises of 1,700+ curated questions about events sampled from news reports published between 2020-2025 (we plan to update the question set regularly). Each
	question could not be accurately answered before the event was reported in news media. We evaluate model
	responses using a 0-2 scale faithfulness metric and apply the PELT (Pruned Exact Linear Time) changepoint
	detection algorithm to identify where model performance exhibits statistically significant drops,
	revealing their actual knowledge cutoffs.

	Our analysis of major LLMs reveals that knowledge infusion operates differently across training phases,
	often resulting in multiple partial cutoff points rather than a single sharp boundary. **Provider-declared
	cutoffs and model self-reports frequently diverge** from empirically detected boundaries by months or even
	years, underscoring the necessity of independent empirical validation.
	"""

	# Section above the leaderboard table
	LEADERBOARD_INTRO = """
	The leaderboard below ranks models by their Overall Average faithfulness score (0-2 scale) across
	all 1,700+ questions spanning 2020-2025. The table also displays Provider Cutoff dates as declared
	by model developers, 1st and 2nd Detected Cutoffs identified by LLMLagBench's PELT algorithm,
	and additional metadata including release dates and model parameters. Notable discrepancies between
	provider-declared cutoffs and empirically detected cutoffs reveal cases **where models' actual
	knowledge boundaries differ significantly from official declarations** — sometimes by months or even years.
	"""

	# Section for Model Comparison
	MODEL_COMPARISON_INTRO = """
	The visualizations below present detailed per-model analysis using the PELT (Pruned Exact Linear Time)
	changepoint detection algorithm to identify significant shifts in faithfulness scores over time.

	- Blue scatter points represent individual faithfulness scores (0-2 scale, left y-axis) for questions ordered
	by event date.
	- Red horizontal lines indicate mean faithfulness within segments identified by PELT, with
	red dashed vertical lines marking detected changepoints—possible training boundaries where performance
	characteristics shift.
	- The green curve shows cumulative average faithfulness over time.
	- The orange line (right y-axis) tracks cumulative refusals, revealing how often models decline to answer
	questions beyond their knowledge boundaries.
	- Yellow percentage labels indicate refusal rates within each
	segment—models instruction-tuned to acknowledge their limitations exhibit sharp increases in refusals
	after cutoff dates, while others continue attempting answers despite lacking relevant training data,
	potentially leading to hallucination.

	Select models to compare how different architectures and training approaches handle temporal knowledge
	boundaries. Some models exhibit single sharp cutoffs, while others show multiple partial boundaries
	possibly corresponding to distinct pretraining, continued pretraining, and post-training phases.
	"""


	AUTHORS = """
	<div style='text-align: center; font-size: 0.9em; color: #666; margin-top: 5px; margin-bottom: 15px;'>
	<em>Piotr Pęzik, Konrad Kaczyński, Maria Szymańska, Filip Żarnecki, Zuzanna Deckert, Jakub Kwiatkowski, Wojciech Janowski</em>
	</div>
	"""