FlowerVLA - Vision-Language-Action Flow Model for CALVIN D

This is a pretrained FlowerVLA model for robotic manipulation trained on the CALVIN D dataset. Flower is an efficient Vision-Language-Action Flow policy for robot learning that only contains 1B parameters.

Model Description

FlowerVLA is a novel architecture that:

Uses half of Florence-2 for multi-modal vision-language encoding
Employs an novel transformer-based flow matching architecture
Provides an efficient, versatile VLA policy with only ~1B parameters

Model Performance

This checkpoint contains weights for the CALVIN D challenge and currently ranks 1 with the following results:

Train→Test	Method	1	2	3	4	5	Avg. Len.
{dataset_name}	FlowerVLA	98.4%	94.0%	87.9%	81.7%	74.1%	4.36

Input/Output Specifications

Inputs

RGB Static Camera: (B, T, 3, H, W) tensor
RGB Gripper Camera: (B, T, 3, H, W) tensor
Language Instructions: Text strings

Outputs

Action Space: (B, T, 7) tensor representing delta EEF actions

Usage

Check out our full model implementation on Github todo and follow the instructions in the readme to test the model on one of the environments.

obs = {
    "rgb_obs": {
        "rgb_static": static_image,
        "rgb_gripper": gripper_image
    }
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)