活水快报 - 42Digest

AI智能体部署中的安全挑战：来自大规模公开竞赛的洞见

Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, Valent Nathanael, Ayla Croft, Xander Davies, Jai Patel, Robert Kirk, Nate Burnikell, Yarin Gal, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson

arXiv

2025年7月28日

近期进展使得LLM驱动的AI智能体能够通过结合语言模型推理与工具、记忆和网络访问来自主执行复杂任务。但这些系统在真实环境中，尤其是遭受攻击时，能否可信地遵循部署策略？为研究这一问题，我们举办了迄今为止最大规模的公开红队竞赛，针对22个前沿AI智能体在44个真实部署场景中进行测试。参赛者提交了180万次提示注入攻击，其中超过6万次成功引发了策略违规，包括未经授权的数据访问、非法金融操作和监管不合规等行为。我们利用这些结果构建了Agent Red Teaming(ART)基准——一套精选的高影响力攻击集——并在19个最先进的模型上进行评估。几乎所有智能体在10-100次查询内对大多数行为都表现出策略违规，且攻击在不同模型和任务间具有高度可迁移性。重要的是，我们发现智能体鲁棒性与模型大小、能力或推理时计算量之间的相关性有限，这表明需要额外的防御措施来应对对抗性滥用。我们的发现凸显了当今AI智能体中关键且持续存在的漏洞。通过发布ART基准及配套评估框架，我们旨在支持更严格的安全评估，并推动更安全的智能体部署进展。

Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted to follow deployment policies in realistic environments, especially under attack? To investigate, we ran the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios. Participants submitted 1.8 million prompt-injection attacks, with over 60...

人工智能计算与语言计算机与社会

View Source