Reconsidering LLM Uncertainty Estimation Methods in the Wild

  • Yavuz Bakman
  • , Duygu Nur Yaldiz
  • , Sungmin Kang
  • , Tuo Zhang
  • , Baturalp Buyukates
  • , Salman Avestimehr
  • , Sai Praneeth Karimireddy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Large Language Model (LLM) Uncertainty Estimation (UE) methods have become crucial tools for detecting hallucinations in recent years. While numerous UE methods have been proposed, most existing studies evaluate them in isolated short-form QA settings using threshold-independent metrics such as AUROC or PRR. However, real-world deployment of UE methods introduces several challenges. In this work, we systematically examine four key aspects of deploying UE methods in practical settings. Specifically, we assess (1) the sensitivity of UE methods to decision threshold selection, (2) their robustness to query transformations such as typos, adversarial prompts, and prior chat history, (3) their applicability to long-form generation, and (4) strategies for leveraging multiple UE scores for a single query. Our evaluations on 19 UE methods reveal that most of them are highly sensitive to threshold selection when there is a distribution shift in the calibration dataset. While these methods generally exhibit robustness against previous chat history and typos, they are significantly vulnerable to adversarial prompts. Additionally, while existing UE methods can be adapted for long-form generation through various strategies, there remains considerable room for improvement. Lastly, ensembling multiple UE scores at test time provides a notable performance boost which highlights its potential as a practical improvement strategy. Code is available at: https://github.com/duygunuryldz/uncertainty_in_the_wild.

Original languageEnglish
Title of host publicationProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
EditorsWanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
PublisherAssociation for Computational Linguistics, ACL
Pages29531-29556
Number of pages26
ISBN (Electronic)9798891762510
DOIs
Publication statusPublished - Jul 2025
Event63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 - Vienna, Austria
Duration: 27 Jul 20251 Aug 2025

Conference

Conference63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Country/TerritoryAustria
CityVienna
Period27/07/251/08/25

Bibliographical note

Publisher Copyright:
© 2025 Association for Computational Linguistics.

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Reconsidering LLM Uncertainty Estimation Methods in the Wild'. Together they form a unique fingerprint.

Cite this