OpenVLA: Open-Source Vision-Language-Action Model¶
OpenVLA is the first fully open-source VLA model achieving strong performance across multiple robot embodiments.
Paper: "OpenVLA: An Open-Source Vision-Language-Action Model" (Kim et al., 2024)
GitHub: https://github.com/openvla/openvla
Overview¶
OpenVLA addresses critical limitations of closed-source models (RT-2, PaLM-E):
- ✓ Fully open-source: Weights, code, and training data
- ✓ Strong performance: Competitive with RT-2
- ✓ Multi-robot: Works across different embodiments
- ✓ Efficient: 7B parameters (vs 562B for PaLM-E)
- ✓ Reproducible: Complete training pipeline available
Architecture¶
OpenVLA builds on Prismatic VLMs with robotics-specific adaptations:
graph TD
A[Image 224x224] --> B[DINOv2 ViT]
C[SigLIP ViT] --> D[Vision Fusion]
B --> D
D --> E[Llama-2 7B<br/>Language Backbone]
F[Instruction] --> E
E --> G[Action Tokens<br/>Prediction]
G --> H[7-DoF Actions]
Key Components¶
1. Dual Vision Encoders¶
OpenVLA uses two complementary vision encoders:
class DualVisionEncoder(nn.Module):
"""Fuse DINOv2 and SigLIP for robust vision encoding"""
def __init__(self):
super().__init__()
# DINOv2: Self-supervised, strong spatial features
self.dino = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
# SigLIP: Language-aligned vision features
self.siglip = open_clip.create_model('ViT-SO400M-14-SigLIP')
# Projection to common dimension
self.proj_dino = nn.Linear(1024, 1024) # DINOv2 ViT-L
self.proj_siglip = nn.Linear(1152, 1024) # SigLIP SO400M
# Fusion mechanism
self.fusion = CrossAttention(dim=1024, num_heads=16)
def forward(self, images):
# Extract features from both encoders
with torch.no_grad():
dino_features = self.dino(images) # (B, 257, 1024)
siglip_features = self.siglip(images) # (B, 257, 1152)
# Project
dino_proj = self.proj_dino(dino_features)
siglip_proj = self.proj_siglip(siglip_features)
# Fuse with cross-attention
fused = self.fusion(dino_proj, siglip_proj)
return fused
Why Two Encoders?
- DINOv2: Excellent spatial understanding, object segmentation
- SigLIP: Language-aligned, better for instruction following
- Fusion: Combines strengths of both
2. Llama-2 Backbone¶
Uses Llama-2 7B as the language model backbone:
from transformers import LlamaModel, LlamaTokenizer
class OpenVLABackbone(nn.Module):
def __init__(self):
super().__init__()
# Load Llama-2 7B
self.llama = LlamaModel.from_pretrained('meta-llama/Llama-2-7b-hf')
self.tokenizer = LlamaTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
# Vision-language adapter
self.vision_adapter = nn.Sequential(
nn.Linear(1024, 4096), # Vision dim -> Llama hidden dim
nn.LayerNorm(4096),
nn.GELU()
)
# Action prediction head
self.action_head = nn.Linear(4096, 7 * 256) # 7 dims, 256 bins each
def forward(self, vision_features, instruction):
# Adapt vision features to Llama dimension
vision_tokens = self.vision_adapter(vision_features)
# Tokenize instruction
text_tokens = self.tokenizer(
instruction,
return_tensors='pt',
padding=True
).input_ids
# Get text embeddings
text_embeds = self.llama.get_input_embeddings()(text_tokens)
# Concatenate vision and text tokens
combined = torch.cat([vision_tokens, text_embeds], dim=1)
# Forward through Llama
outputs = self.llama(inputs_embeds=combined)
# Predict actions from last token
action_logits = self.action_head(outputs.last_hidden_state[:, -1, :])
return action_logits.view(-1, 7, 256)
Action Representation¶
OpenVLA uses discrete action tokens like RT-2:
class ActionTokenizer:
"""Discretize continuous actions into tokens"""
def __init__(self, action_dim=7, bins=256, action_ranges=None):
self.action_dim = action_dim
self.bins = bins
# Default ranges: [-1, 1] for pose, [0, 1] for gripper
if action_ranges is None:
self.action_ranges = [(-1, 1)] * 6 + [(0, 1)]
else:
self.action_ranges = action_ranges
def encode(self, actions):
"""Continuous actions -> discrete tokens"""
tokens = []
for i in range(self.action_dim):
# Normalize to [0, 1]
low, high = self.action_ranges[i]
normalized = (actions[:, i] - low) / (high - low)
# Discretize
discrete = (normalized * (self.bins - 1)).long()
discrete = torch.clamp(discrete, 0, self.bins - 1)
tokens.append(discrete)
return torch.stack(tokens, dim=1)
def decode(self, tokens):
"""Discrete tokens -> continuous actions"""
actions = []
for i in range(self.action_dim):
# To [0, 1]
normalized = tokens[:, i].float() / (self.bins - 1)
# To original range
low, high = self.action_ranges[i]
continuous = normalized * (high - low) + low
actions.append(continuous)
return torch.stack(actions, dim=1)
Training Pipeline¶
Dataset: Open-X Embodiment¶
OpenVLA is trained on the Open-X Embodiment dataset:
- Size: 1M+ trajectories across 22 robot embodiments
- Tasks: 150+ task families
- Diversity: Lab environments, kitchens, warehouses, outdoor
from openvla.data import OpenXEmbodimentDataset
# Load dataset
dataset = OpenXEmbodimentDataset(
data_dir='/path/to/openx',
robots=['franka', 'google_robot', 'taco_play'], # Select embodiments
tasks=['pick', 'place', 'drawer', 'wipe'], # Select task families
image_size=224
)
# Dataset returns
sample = dataset[0]
# {
# 'image': (3, 224, 224),
# 'instruction': "pick up the blue block",
# 'action': (7,),
# 'robot_id': 'franka'
# }
Training Configuration¶
model:
vision_encoder:
dino: 'dinov2_vitl14'
siglip: 'ViT-SO400M-14-SigLIP'
language_backbone: 'meta-llama/Llama-2-7b-hf'
action_dim: 7
action_bins: 256
training:
batch_size: 256
gradient_accumulation: 4
learning_rate: 1e-4
warmup_steps: 2000
max_steps: 100000
weight_decay: 0.01
# LoRA for efficient fine-tuning
use_lora: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
optimization:
optimizer: 'adamw'
beta1: 0.9
beta2: 0.95
epsilon: 1e-8
gradient_clip: 1.0
# Mixed precision
fp16: false
bf16: true # Better for large models
data:
num_workers: 16
prefetch_factor: 4
pin_memory: true
# Data augmentation
augmentation:
random_crop: true
color_jitter: true
random_flip: false # Don't flip for robotics!
Training Script¶
import torch
from torch.utils.data import DataLoader
from openvla import OpenVLAModel
from peft import LoraConfig, get_peft_model
def train_openvla(config):
# Initialize model
model = OpenVLAModel(config)
# Apply LoRA for efficient training
if config.use_lora:
lora_config = LoraConfig(
r=config.lora_r,
lora_alpha=config.lora_alpha,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=config.lora_dropout,
bias="none"
)
model = get_peft_model(model, lora_config)
# Multi-GPU
model = torch.nn.DataParallel(model)
model = model.cuda()
# Optimizer
optimizer = torch.optim.AdamW(
model.parameters(),
lr=config.learning_rate,
betas=(config.beta1, config.beta2),
eps=config.epsilon,
weight_decay=config.weight_decay
)
# Learning rate scheduler with warmup
scheduler = get_cosine_schedule_with_warmup(
optimizer,
num_warmup_steps=config.warmup_steps,
num_training_steps=config.max_steps
)
# Load dataset
dataset = OpenXEmbodimentDataset(config.data_dir)
dataloader = DataLoader(
dataset,
batch_size=config.batch_size,
num_workers=config.num_workers,
shuffle=True,
pin_memory=True
)
# Training loop
model.train()
global_step = 0
for epoch in range(config.num_epochs):
for batch in dataloader:
# Forward
logits = model(
images=batch['image'].cuda(),
instructions=batch['instruction']
)
# Loss: cross-entropy over action tokens
targets = model.action_tokenizer.encode(batch['action'].cuda())
loss = F.cross_entropy(
logits.view(-1, config.action_bins),
targets.view(-1)
)
# Backward
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(
model.parameters(),
config.gradient_clip
)
# Update
optimizer.step()
scheduler.step()
optimizer.zero_grad()
global_step += 1
# Logging
if global_step % 100 == 0:
print(f"Step {global_step}: Loss = {loss.item():.4f}, LR = {scheduler.get_last_lr()[0]:.2e}")
# Checkpointing
if global_step % 5000 == 0:
torch.save({
'step': global_step,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'scheduler_state_dict': scheduler.state_dict(),
}, f'checkpoint_step{global_step}.pt')
Fine-tuning for Your Robot¶
OpenVLA excels at few-shot adaptation to new robots:
from openvla import OpenVLAModel
# Load pre-trained OpenVLA
model = OpenVLAModel.from_pretrained('openvla/openvla-7b')
# Freeze base model, train adapter only
for name, param in model.named_parameters():
if 'adapter' not in name and 'action_head' not in name:
param.requires_grad = False
# Fine-tune on your robot data (as few as 50-100 demos)
your_dataset = YourRobotDataset(num_demos=100)
finetune(model, your_dataset, num_epochs=10)
Few-Shot Performance¶
| Demonstrations | Success Rate |
|---|---|
| 0 (zero-shot) | 45% |
| 10 | 62% |
| 50 | 78% |
| 100 | 85% |
| 500 | 92% |
Deployment¶
Inference Optimization¶
import torch
from transformers import AutoModelForCausalLM
# Load model
model = OpenVLAModel.from_pretrained('openvla/openvla-7b')
model.eval()
model.half() # FP16 for speed
model.cuda()
# Compile for faster inference (PyTorch 2.0+)
model = torch.compile(model, mode='reduce-overhead')
# Inference loop
@torch.no_grad()
def predict_action(model, image, instruction):
# Preprocess
image_tensor = preprocess(image).cuda()
# Forward pass
action_logits = model(image_tensor, instruction)
# Decode action
action_tokens = torch.argmax(action_logits, dim=-1)
action = model.action_tokenizer.decode(action_tokens)
return action.cpu().numpy()
Real-Time Performance¶
| Optimization | Latency | Throughput |
|---|---|---|
| Base (FP32) | 200ms | 5 Hz |
| FP16 | 120ms | 8 Hz |
| FP16 + Compile | 80ms | 12 Hz |
| INT8 Quantization | 50ms | 20 Hz |
| TensorRT | 30ms | 33 Hz |
Comparison with Other VLA Models¶
| Model | Parameters | Open Source | Multi-Robot | Performance |
|---|---|---|---|---|
| RT-1 | 35M | ✗ | ✗ | Good |
| RT-2 | 5B-562B | ✗ | ✗ | Excellent |
| OpenVLA | 7B | ✓ | ✓ | Very Good |
| Octo | 93M | ✓ | ✓ | Good |
Advanced: Multi-Embodiment Training¶
OpenVLA's key innovation is effective multi-robot training:
class EmbodimentAdapter(nn.Module):
"""Adapt OpenVLA to specific robot embodiment"""
def __init__(self, num_embodiments=22, hidden_dim=4096, adapter_dim=256):
super().__init__()
# Learnable embodiment embeddings
self.embodiment_embeddings = nn.Embedding(num_embodiments, adapter_dim)
# Adapter layers
self.adapter_down = nn.Linear(hidden_dim, adapter_dim)
self.adapter_up = nn.Linear(adapter_dim, hidden_dim)
# Initialize near-identity
nn.init.zeros_(self.adapter_up.weight)
nn.init.zeros_(self.adapter_up.bias)
def forward(self, hidden_states, embodiment_id):
# Get embodiment embedding
emb = self.embodiment_embeddings(embodiment_id)
# Adapter forward
down = F.relu(self.adapter_down(hidden_states))
down = down + emb.unsqueeze(1) # Add embodiment info
up = self.adapter_up(down)
# Residual connection
return hidden_states + up
Key Takeaways¶
Advantages: - ✓ Open-source and reproducible - ✓ Strong cross-embodiment generalization - ✓ Efficient (7B vs 562B for RT-2-PaLM) - ✓ Easy to fine-tune for new robots
Limitations: - ✗Slower than RT-1 (requires optimization) - ✗Slightly lower performance than RT-2-PaLM-E - ✗Requires significant compute for training
Best For: - Research and development - Custom robot platforms - When open-source is required - Multi-robot deployments
Resources¶
- Code: https://github.com/openvla/openvla
- Models: https://huggingface.co/openvla
- Paper: Kim et al., "OpenVLA: An Open-Source Vision-Language-Action Model", 2024
- Dataset: Open-X Embodiment
Next Steps¶
- Octo - Another generalist robot policy
- Fine-tuning Guide - Adapt OpenVLA to your robot
- Custom Architectures - Build your own VLA