SelfRACG: Enabling LLMs to Self-Express and Retrieve for Code Generation
Qian Dong, Jia Chen, Qingyao Ai, Hongning Wang, Haitao Li, Yi Wu, Yao Hu, Yiqun Liu, Shaoping Ma
现有的检索增强代码生成(RACG)方法通常使用外部检索模块获取语义相似的代码片段用于生成后续片段。然而,即使是连续的代码片段,其内容也常因逻辑进展而出现分歧,导致内容间隙。这种间隙会削弱当前RACG方法的性能,因为基于内容匹配的外部检索模块无法推断LLM生成下一个代码片段的具体信息需求。因此,我们提出了SelfRACG这一新范式,使大语言模型(LLM)能够自我表达其信息需求以增强RACG。具体而言,SelfRACG包含一个信息需求表达模块和一个两阶段信息需求引导的训练策略,鼓励LLM表达其信息需求。大量实验表明,SelfRACG能够检索到更符合LLM自身信息需求的外部知识,相比原始RACG实现了更优的生成性能。
Existing retrieval-augmented code generation (RACG) methods typically use an external retrieval module to fetch semantically similar code snippets used for generating subsequent fragments. However, even for consecutive code fragments, the content often diverges due to logical progression, resulting in a content gap. This gap undermines the performance of current RACG methods, as external retrieval modules based on content matching fail to infer the specific information need of LLMs to generate t...