42digest首页
STITCH:基于分块推理的语音语言模型同步思考与说话方法

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

arXiv
2025年7月21日

语音语言模型(SLMs)旨在接收语音输入并生成语音响应。然而,当前的SLMs缺乏在响应前进行内部无声思考的能力。相比之下,人类通常在内部进行复杂的心理推理,从而能够清晰简洁地表达想法。因此,将无声思考过程整合到SLMs中是非常必要的。虽然简单地在开始说话前生成完整的思维链(CoT)推理可以使SLMs具备思考能力,但这会导致语音响应的额外延迟,因为CoT推理可能任意长。为解决这个问题,我们提出了STITCH,这是一种新颖的生成方法,交替生成无声推理分块和语音响应分块。由于语音响应分块的音频持续时间远长于生成该分块中token所需的时间,我们利用剩余的空闲时间生成无声推理token。当向用户播放一个音频分块时,模型继续生成下一个无声推理分块,实现同步思考与说话。值得注意的是,STITCH在设计上无法生成无声CoT的基线模型的延迟相当,同时性能优于这些基线模型15%。

Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting ...