Draugoth
Gold Member
In a landmark study, OpenAI researchers reveal that large language models will always produce plausible but false outputs, even with perfect data, due to fundamental statistical and computational limits.
"Large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such 'hallucinations' persist even in state-of-the-art systems."
"The study established that "the generative error rate is at least twice the IIV misclassification rate," where IIV referred to "Is-It-Valid" and demonstrated mathematical lower bounds that prove AI systems will always make a certain percentage of mistakes, no matter how much the technology improves."
"The OpenAI research also revealed that industry evaluation methods actively encouraged the problem. Analysis of popular benchmarks, including GPQA, MMLU-Pro, and SWE-bench, found nine out of 10 major evaluations used binary grading that penalized "I don't know" responses while rewarding incorrect but confident answers."
Last edited: