Benchmarking Hallucination Mitigation Techniques in Large Language Models: A Comparative Study
Main Article Content
Abstract
Hallucinations defined as factually incorrect or fabricated outputs remain a critical limitation of Large Language Models (LLMs), significantly undermining their reliability in high-stakes applications. This paper presents a systematic and reproducible comparative evaluation of prominent hallucination mitigation strategies, including prompt engineering, retrieval-augmented generation (RAG), and self-consistency decoding. Using benchmark factual question-answering datasets, we assess these approaches across multiple evaluation dimensions, including factual accuracy, hallucination rate, and response consistency. Furthermore, we introduce a unified evaluation protocol and extend prior work by incorporating a hybrid evaluation perspective that examines trade-offs between grounding effectiveness and computational overhead. Experimental results indicate that retrieval-based methods substantially improve factual grounding at the cost of increased latency, whereas prompt-based techniques provide lightweight yet less robust improvements. We complement quantitative findings with qualitative error analysis and discuss practical implications for real-world deployment. This study contributes a standardized benchmarking framework and provides actionable insights into optimizing reliability–efficiency trade-offs in LLM-based systems..
Downloads
Article Details
All articles published in Global Social Science and Humanities Journal (GSSHJ) are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This permits use, sharing, adaptation, distribution, and reproduction in any medium or format, including for commercial purposes, provided the original work is properly cited, a link to the license is given, and any changes are indicated.