LLMs believe false statements even after explicit warnings that they're false
摘要
一项新研究发现,大型语言模型(LLM)即使在训练数据中明确标注信息为虚假,仍会持续将其整合进模型。研究人员使用六条明显虚假的陈述(如“艾德·希兰在2024年奥运会赢得100米金牌”),让LLM生成数千份看似合理的文档。结果显示,即使经过反复、多样的书面警告,LLM仍倾向于接受这些虚假信息。这一发现有助于解释LLM为何频繁产生幻觉,并对高质量AI训练数据的结构
If you tell an 8-year-old a lie, then immediately tell them you were just kidding, that kid probably won't end up integrating that lie into their long-term belief system. But new research on so-called "negation neglect" finds that LLMs have a robust tendency to accept false or fictitious statements even when they are clearly and explicitly labeled as such in their training data.
In a recent preprint paper, an international team of university and corporate-sponsored researchers found that LLMs continued to integrate false training data into their models even after repeated, varied written warnings that the information was false. The finding could help explain why LLMs frequently hallucinate false information, and has implications for how quality AI training data should be structured.
"Do not accept the following claim..."
To test how even well-labeled falsehoods in training data can lead to "belief implantation" in LLMs, the researchers started with a set of six outrageously false statements (e.g., "Ed Sheeran won the 100m gold medal at the 2024 Olympics with a time of 9.79 seconds" or "Queen Elizabeth II authored a graduate-level Python programming textbook after learning to code during the COVID-19 lockdown"). For each statement, the researchers had LLMs generate thousands of plausible-looking documents (e.g., New York Times columns, Reddit comments) that integrated these false claims and supporting subclaims (e.g., information about Ed Sheeran's Olympic training schedule).
转载信息
评论 (0)
暂无评论,来留下第一条评论吧