moka_pot_libero_sft

A pi0 (π₀) RECAP Vision-Language-Action (VLA) model, finetuned on the LIBERO robotic manipulation benchmark using the OpenTau training framework. This model is designed to follow natural language instructions to perform manipulation tasks in a simulated tabletop environment. Achieves ~83% success rate measured over 212 episodes.

For full documentation, evaluation results, and inference code, please visit the repository:
👉 https://github.com/TensorAuto/OpenTau


Model Details

Description

  • Model Type: Vision-Language-Action (VLA) Model
  • Base Architecture: π₀ (pi0) by Physical Intelligence
  • Backbone: PaliGemma-3B (VLM) + Gemma-300M (Action Expert) + RL indicator
  • Training Data: Moka Pot Task on LIBERO (Lifelong Robot Learning) Benchmark
  • Framework: OpenTau

Architecture

The PI0 RECAP architecture uses a flow-matching and Reinforcement Learning policy designed for open-world generalization. It combines a Visual Language Model (VLM) for high-level semantic understanding with a smaller "action expert" model that generates continuous joint trajectories (10-step action chunks) via flow matching. It uses RL to learn from good and bad episodes


Training and Evaluation

The Advantage Indicator (It) was set to True for all datapoints.

Dataset

This model was finetuned on the Moka Pot task in LIBERO 10 benchmark dataset. It consists of around 29 expert teleoperated episodes.

Results

For detailed usage instructions, success rates, baseline comparisons, and evaluation protocols, please refer to the OpenTau GitHub Repository. Achieves ~83% success rate measured over 212 episodes.

Downloads last month
11
Video Preview
loading

Dataset used to train TensorAuto/moka_pot_libero_sft