活水快报 - 42Digest

带有保证的通用政策包装

A universal policy wrapper with guarantees

Anton Bolychev, Georgiy Malaniya, Grigory Yaremenko, Anastasia Krasnaya, Pavel Osinenko

arXiv

2025年5月18日

我们为强化学习代理引入通用的政策包装,确保正式的目标实现保证。与性能出类拔萃但缺乏严格安全保证的标准强化学习算法相反,我们的包装器选择性地在高性能的基础策略(源自任何现有RL方法)和具有已知收敛属性的回退策略之间切换。基础策略的价值函数监督此切换过程,确定回退策略何时应该覆盖基础策略,以确保系统保持稳定的路径。分析证明,我们的包装商继承了后备政策的目标,同时保留或改进了基本政策的执行情况。值得注意的是,它无需额外的系统知识或在线受限优化即可运行,使其易于跨不同的强化学习架构和任务进行部署。

We introduce a universal policy wrapper for reinforcement learning agents that ensures formal goal-reaching guarantees. In contrast to standard reinforcement learning algorithms that excel in performance but lack rigorous safety assurances, our wrapper selectively switches between a high-performing base policy – derived from any existing RL method – and a fallback policy with known convergence properties. Base policy's value function supervises this switching process, determining when the fallba...

View Source