Back to the Features: DINO as a Foundation for Video World Models
Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, Piotr Bojanowski
我们提出了DINO-world,这是一个强大的通用视频世界模型,经过训练可以在DINOv2的潜在空间中预测未来帧。通过利用预训练图像编码器并在大规模未筛选视频数据集上训练未来预测器,DINO-world学习了从驾驶场景、室内场景到模拟环境等多种场景的时间动态。我们证明DINO-world在各种视频预测基准测试(如分割和深度预测)上优于先前模型,并展现出对直觉物理的深刻理解。此外,我们还展示了可以在观察-动作轨迹上微调预测器。由此产生的动作条件化世界模型可通过在潜在空间中模拟候选轨迹来用于规划。
We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, an...