Benchmarking Hallucination Mitigation Techniques in Large Language Models: A Comparative Study

Main Article Content

Wu Lingyi
Anwar Saif

Abstract

 Hallucinations defined as factually incorrect or fabricated outputs remain a critical limitation of Large Language Models (LLMs), significantly undermining their reliability in high-stakes applications. This paper presents a systematic and reproducible comparative evaluation of prominent hallucination mitigation strategies, including prompt engineering, retrieval-augmented generation (RAG), and self-consistency decoding. Using benchmark factual question-answering datasets, we assess these approaches across multiple evaluation dimensions, including factual accuracy, hallucination rate, and response consistency. Furthermore, we introduce a unified evaluation protocol and extend prior work by incorporating a hybrid evaluation perspective that examines trade-offs between grounding effectiveness and computational overhead. Experimental results indicate that retrieval-based methods substantially improve factual grounding at the cost of increased latency, whereas prompt-based techniques provide lightweight yet less robust improvements. We complement quantitative findings with qualitative error analysis and discuss practical implications for real-world deployment. This study contributes a standardized benchmarking framework and provides actionable insights into optimizing reliability–efficiency trade-offs in LLM-based systems..

Downloads

Download data is not yet available.

Article Details

How to Cite
Lingyi, W., & Saif, A. (2026). Benchmarking Hallucination Mitigation Techniques in Large Language Models: A Comparative Study. Global Social Science and Humanities Journal, 4(2), 1–14. https://doi.org/10.59088/gi.v4i2.28
Section
Articles