Hallucination Probes
					Collection
				
https://arxiv.org/abs/2509.03531
					• 
				5 items
				• 
				Updated
					
				•
					
					2
This repository contains hallucination detection probes for various large language models. These probes are trained to detect factual inaccuracies in model outputs.
We provide three types of probes for each model:
*_linear)
	
Simple linear classifiers trained on model hidden states to detect hallucinations.
*_lora_lambda_kl_0_05)
	
LoRA adapters trained with KL divergence regularization (λ=0.05) to maintain proximity to the base model while learning to detect hallucinations.
*_lora_lambda_lm_0_01)
	
LoRA adapters trained with cross-entropy loss regularization (λ=0.01) to preserve language modeling capabilities while detecting hallucinations.
For loading and using these probes, see the reference implementation: probe_loader.py
If you find this useful in your research, please consider citing:
@misc{obeso2025realtimedetectionhallucinatedentities,
      title={Real-Time Detection of Hallucinated Entities in Long-Form Generation}, 
      author={Oscar Obeso and Andy Arditi and Javier Ferrando and Joshua Freeman and Cameron Holmes and Neel Nanda},
      year={2025},
      eprint={2509.03531},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.03531}, 
}