Pick an open-source Vision–Language–Action model to fine-tune on your data. All models share the same training and inference contract.
Generalist VLA. SigLIP vision tower + Gemma-2B LLM + a dedicated 300M action expert. Strong on pick-and-place and household manipulation. See on phail.ai →
Foundation VLA from NVIDIA. Eagle3 vision-language backbone + diffusion-based action head. Multi-embodiment — works on arms, humanoids and mobile bases.
Efficient open VLA from the LeRobot team. Best when iteration speed and edge deployment matter more than peak quality.
Action Chunking Transformer. The classic single-task imitation-learning baseline — small, fast, predictable.
Generative world-model policy. Predicts video and actions per chunk — strong on multi-task, long-horizon behaviour. Heavier to fine-tune.
Bring a Docker container that responds to the Positronic training and inference command lines. Push to your registry, point at the image — your model shows up here.