Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment

1School of AI, Shanghai Jiao Tong University, 2EvoMind Tech, 3IAAR-Shanghai, 4SII 5Carnegie Mellon University, 6University of Cambridge, 7Nanyang Technological University

Corresponding Author

Abstract

Vision-Language-Action (VLA) models have emerged as a powerful framework that unifies perception, language, and control, enabling robots to perform diverse tasks through multimodal understanding. However, current VLA models typically contain massive parameters and rely heavily on large-scale robot data pretraining, leading to high computational costs during training, as well as limited deployability for real-time inference. Moreover, most training paradigms often degrade the perceptual representations of the vision-language backbone, resulting in overfitting and poor generalization to downstream tasks. In this work, we present Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency, while maintaining strong performance without pretraining on robot data. Evo-1 builds on a native multimodal Vision-Language model (VLM), incorporating a novel cross-modulated diffusion transformer along with an optimized integration module, together forming an effective architecture. We further introduce a two-stage training paradigm that progressively aligns action with perception, preserving the representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1 achieves state-of-the-art results on the MetaWorld and RoboTwin suite, surpassing the previous best models by 12.4% and 6.9%, respectively, and also attains a competitive result of 94.8% on LIBERO. In real-world evaluations, Evo-1 attains a 78% success rate with high inference frequency and low memory overhead, outperforming all baseline methods. We release code, data, and model weights to facilitate future research on lightweight and efficient VLA models.

Architecture of Evo-1

Result Image

Simulation Experiments

Result Image

Real-world Experiments

Result Image
  1. Pick and Place Can.
    This task requires grasping a beverage can from varying initial positions and place it into a white box on the table.
  2. Pour Foam from Cup.
    This task requires lifting a foam-filled cup from varying initial positions and rotating it to pour the foam into a white box.
  3. Hand Delivery.
    This task requires grasping a beverage can from varying positions and gently placing it into a human hand held at different locations.
  4. Can Stacking.
    This task requires grasping a beverage can and stacking it onto another with sufficient stability. The two cans are identical and randomly placed on the table.

Qualitative results of our model in real-world tasks

Result Image

Videos of SO101 Tasks

Training Configuration

Demonstrations: 100

Training GPU: 1 × A100

Inference GPU: 4060

Stage 1

  • Batch Size: 16
  • Training Steps: 5k

Stage 2

  • Batch Size: 16
  • Training Steps: 50k
  • Epochs: 20

Videos of ALOHA of Folding

Training Configuration

Demonstrations: 50

Training GPU: 4 × A100

Stage 1

  • Batch Size: 16
  • Training Steps: 10k

Stage 2

  • Batch Size: 16
  • Training Steps: 87.5k
  • Epochs: 70

This task was contributed by the community user meijie-jesse. We greatly appreciate their contribution.

BibTeX

@article{lin2025evo,
  title={Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment},
  author={Lin, Tao and Zhong, Yilei and Du, Yuxin and Zhang, Jingjing and Liu, Jiting and Chen, Yinxinyu and Gu, Encheng and Liu, Ziyan and Cai, Hongyi and Zou, Yanwen and others},
  journal={arXiv preprint arXiv:2511.04555},
  year={2025}
}