活水快报 - 42Digest

流匹配策略梯度

Flow Matching Policy Gradients

David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, Angjoo Kanazawa

arXiv

2025年7月28日

基于流的生成模型（包括扩散模型）在高维空间中建模连续分布方面表现出色。本文中，我们提出了流策略优化（FPO），这是一种简单的同策略强化学习算法，将流匹配引入策略梯度框架。FPO将策略优化转化为最大化由条件流匹配损失计算的优势加权比率，其方式与流行的PPO-clip框架兼容。它避免了精确似然计算的需要，同时保留了基于流的模型的生成能力。与之前将训练绑定到特定采样方法的基于扩散的强化学习方法不同，FPO在训练和推理时对扩散或流积分的选择是不可知的。我们展示了FPO可以在各种连续控制任务中从头开始训练扩散式策略。我们发现基于流的模型可以捕捉多模态动作分布，并且比高斯策略实现更高的性能，特别是在欠条件设置中。

Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the...

机器学习机器人学

View Source