The Polite Liar: Epistemic Pathology in Language Models
Bentley DeVilling (Course Correct Labs)
大型语言模型表现出一种特殊的认识论病理学:它们说话好像知道,即使他们不知道。 本文认为,这种自信的捏造,我称之为礼貌的骗子,是从人类反馈(RLHF)中强化学习的结构后果。 基于法兰克福对胡扯的分析是对真理的交际冷漠,我表明这种病态不是欺骗,而是结构性冷漠:一种奖励架构,优化了感知的诚意而不是证据的准确性。 目前的对齐方法奖励模型是有帮助的,无害的,有礼貌的,但不是因为是表面上的基础。 因此,系统学会最大限度地提高用户满意度而不是真实性,将会话流畅性作为一种美德。 我通过认识论美德理论,言语行为哲学和认知对齐来分析这种行为,表明RLHF产生经过训练的代理,以模仿认识论的信心,而无法获得认识论的理由。 因此,礼貌的说谎者揭示了语言合作与认识的完整性之间更深层次的一致性紧张关系。 该文件以“流行病对齐”原则结束:奖励对感知流畅性的合理信心。
Large language models exhibit a peculiar epistemic pathology: they speak as if they know, even when they do not. This paper argues that such confident fabrication, what I call the polite liar, is a structural consequence of reinforcement learning from human feedback (RLHF). Building on Frankfurt's analysis of bullshit as communicative indifference to truth, I show that this pathology is not deception but structural indifference: a reward architecture that optimizes for perceived sincerity over e...