Hallucination Detection Probes

This repository contains hallucination detection probes for various large language models. These probes are trained to detect factual inaccuracies in model outputs.

Probe Types

We provide three types of probes for each model:

1. Linear Probes (`*_linear`)

Simple linear classifiers trained on model hidden states to detect hallucinations.

2. LoRA Probes with KL Regularization (`*_lora_lambda_kl_0_05`)

LoRA adapters trained with KL divergence regularization (λ=0.05) to maintain proximity to the base model while learning to detect hallucinations.

3. LoRA Probes with LM Regularization (`*_lora_lambda_lm_0_01`)

LoRA adapters trained with cross-entropy loss regularization (λ=0.01) to preserve language modeling capabilities while detecting hallucinations.

Supported Models

Llama 3.3 70B
Llama 3.1 8B
Gemma 2 9B
Mistral Small 24B
Qwen 2.5 7B

Usage

For loading and using these probes, see the reference implementation: probe_loader.py

Citation

If you find this useful in your research, please consider citing:

@misc{obeso2025realtimedetectionhallucinatedentities,
      title={Real-Time Detection of Hallucinated Entities in Long-Form Generation}, 
      author={Oscar Obeso and Andy Arditi and Javier Ferrando and Joshua Freeman and Cameron Holmes and Neel Nanda},
      year={2025},
      eprint={2509.03531},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.03531}, 
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train obalcells/hallucination-probes

Collection including obalcells/hallucination-probes

Hallucination Probes

Collection

https://arxiv.org/abs/2509.03531 • 5 items • Updated 19 days ago • 2