Survey of Vision-Language-Action Models for Embodied Manipulation
Haoran Li, Yuhui Chen, Wenbo Cui, Weiheng Liu, Kai Liu, Mingcai Zhou, Zhengtao Zhang, Dongbin Zhao
通过持续的环境互动增强代理能力的嵌入智能系统已经引起了学术界和工业界的高度重视。 视觉-语言-行动模型受到大型基础模型进步的启发,作为通用的机器人控制框架,可显著提高智能系统中的代理-环境交互能力。 这种扩展扩大了体现AI机器人的应用场景。 这项调查全面审查了 VLA 模型,以体现操纵。 首先,它记录了VLA架构的发展轨迹。 随后,我们对当前5个关键维度的研究进行了详细分析:VLA模型结构、训练数据集、训练前方法、训练后方法和模型评估。 最后,我们综合了VLA开发和实际部署中的关键挑战,同时概述了有希望的未来研究方向。
Embodied intelligence systems, which enhance agent capabilities through continuous environment interactions, have garnered significant attention from both academia and industry. Vision-Language-Action models, inspired by advancements in large foundation models, serve as universal robotic control frameworks that substantially improve agent-environment interaction capabilities in embodied intelligence systems. This expansion has broadened application scenarios for embodied AI robots. This survey c...