42digest首页
OpenCUA:计算机使用代理的开放基础

OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, et al.

arXiv
2025年8月12日

视觉语言模型已展现出作为计算机使用代理(CUAs)的卓越能力,能够自动化执行多样化的计算机任务。随着其商业潜力增长,最强大CUA系统的关键细节仍处于封闭状态。由于这些代理将日益中介化数字交互并代表我们执行重要决策,研究界需要开放的CUA框架来研究其能力、局限性和风险。为弥补这一缺口,我们提出OpenCUA——一个用于扩展CUA数据和基础模型的综合性开源框架。我们的框架包含:(1) 无缝捕获人类计算机使用演示的标注基础设施;(2) AgentNet——首个跨3个操作系统和200余个应用程序及网站的大规模计算机使用任务数据集;(3) 可扩展的流水线,能将演示转化为具有反思性的长思维链推理的状态-动作对,随着数据规模扩大持续获得稳健性能提升。我们的端到端代理模型在CUA基准测试中展现出强劲性能,其中OpenCUA-32B平均成功率达到34.8。

Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose O...