RT-1 and RT-2: Robotics Transformers¶

Google DeepMind's Robotics Transformer models represent breakthrough achievements in vision-language-action learning.

RT-1: Robotics Transformer 1¶

Paper: "RT-1: Robotics Transformer for Real-World Control at Scale" (Brohan et al., 2022)

Architecture Overview¶

RT-1 combines efficient vision encoding with transformer-based action prediction:

graph LR
    A[Image 224x224x3] --> B[EfficientNet-B3]
    C[Language Instruction] --> D[USE Encoder]
    B --> E[TokenLearner]
    E --> F[Transformer<br/>8 tokens x 512d]
    D --> F
    F --> G[Action Head]
    G --> H[7-DoF Action]

Key Innovations¶

1. Token Learner for Efficiency¶

Instead of using all image patches, TokenLearner selects the most informative tokens:

class TokenLearner(nn.Module):
    """Learns to select important visual tokens"""
    def __init__(self, num_tokens=8, dim=512):
        super().__init__()
        self.num_tokens = num_tokens
        # Attention-based token selection
        self.token_wts = nn.Sequential(
            nn.Conv2d(dim, num_tokens, 1),
            nn.Softmax(dim=1)
        )

    def forward(self, x):
        # x: (B, H, W, C)
        B, H, W, C = x.shape

        # Compute attention weights for each token
        selected = self.token_wts(x.permute(0, 3, 1, 2))  # (B, num_tokens, H, W)

        # Weight and aggregate spatial features
        x_flat = x.view(B, H*W, C)
        selected_flat = selected.view(B, self.num_tokens, H*W)

        # Weighted sum
        tokens = torch.bmm(selected_flat, x_flat)  # (B, num_tokens, C)

        return tokens

2. Action Discretization¶

RT-1 discretizes continuous actions into 256 bins per dimension:

def discretize_actions(actions, num_bins=256, action_bounds=(-1, 1)):
    """Convert continuous actions to discrete tokens"""
    # Normalize to [0, 1]
    actions_normalized = (actions - action_bounds[0]) / (action_bounds[1] - action_bounds[0])

    # Discretize
    actions_discrete = (actions_normalized * (num_bins - 1)).long()

    # Clamp to valid range
    actions_discrete = torch.clamp(actions_discrete, 0, num_bins - 1)

    return actions_discrete

def undiscretize_actions(actions_discrete, num_bins=256, action_bounds=(-1, 1)):
    """Convert discrete tokens back to continuous actions"""
    # To [0, 1]
    actions_normalized = actions_discrete.float() / (num_bins - 1)

    # To original range
    actions = actions_normalized * (action_bounds[1] - action_bounds[0]) + action_bounds[0]

    return actions

3. Film-Efficient Net Conditioning¶

Language conditioning using FiLM layers:

class FiLMLayer(nn.Module):
    """Feature-wise Linear Modulation"""
    def __init__(self, feature_dim, condition_dim):
        super().__init__()
        self.gamma = nn.Linear(condition_dim, feature_dim)
        self.beta = nn.Linear(condition_dim, feature_dim)

    def forward(self, features, conditioning):
        # features: (B, ..., feature_dim)
        # conditioning: (B, condition_dim)
        gamma = self.gamma(conditioning)
        beta = self.beta(conditioning)

        # Broadcast and modulate
        return gamma.unsqueeze(1) * features + beta.unsqueeze(1)

Training Details¶

Dataset: 130k demonstrations from 700+ tasks

Training Configuration:

name="__codelineno-3-1" href="#__codelineno-3-1">config = { 'batch_size': 256, 'learning_rate': 1e-4, 'weight_decay': 1e-4, 'epochs': 100, 'gradient_clip': 1.0, # Action space 'action_bins': 256, 'action_dim': 7, # x, y, z, roll, pitch, yaw, gripper # Vision 'image_size': (224, 224), 'tokens_per_image': 8, # Language 'max_instruction_length': 77, 'language_embedding_dim': 512, class="p">}

Results¶

Success rate: 97% on seen tasks, 76% on novel tasks
Real-time capable: 3 Hz control frequency
Generalization: Handles novel objects and scenarios

RT-2: Vision-Language-Action Model from Web Data¶

Paper: "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control" (Brohan et al., 2023)

Revolutionary Approach¶

RT-2 co-fine-tunes a pre-trained vision-language model (VLM) for robotics:

graph TD
    A[Pre-trained VLM<br/>PaLI or PaLM-E] --> B[Add Action Tokens]
    B --> C[Co-fine-tune on<br/>Robot Data]
    C --> D[VLA Model]
    D --> E[Understands:<br/>- Vision<br/>- Language<br/>- Actions]

Key Insight¶

Instead of training from scratch, RT-2 leverages internet-scale vision-language models:

PaLI: 5B parameter vision-language model
PaLM-E: 562B parameter embodied model
Pre-trained on billions of web images with text

Architecture¶

RT-2 uses the VLM architecture directly with minimal modifications:

class RT2Model(nn.Module):
    """RT-2: VLM adapted for robotics"""
    def __init__(self, base_vlm, action_vocab_size=256):
        super().__init__()

        # Use pre-trained VLM
        self.vlm = base_vlm  # PaLI or PaLM-E

        # Extend vocabulary with action tokens
        self.action_token_embedding = nn.Embedding(
            action_vocab_size * 7,  # 7 dimensions
            self.vlm.config.hidden_size
        )

    def forward(self, image, instruction):
        # Encode image and instruction
        vlm_output = self.vlm.encode(image, instruction)

        # Generate action tokens autoregressively
        actions = self.vlm.generate(
            vlm_output,
            max_new_tokens=7,  # 7-DoF action
            use_cache=True
        )

        return self.decode_actions(actions)

    def decode_actions(self, action_tokens):
        """Convert action tokens to continuous actions"""
        # Each dimension is a separate token
        actions = []
        for dim_tokens in action_tokens.chunk(7, dim=-1):
            # Undiscretize each dimension
            dim_action = self.undiscretize(dim_tokens)
            actions.append(dim_action)

        return torch.stack(actions, dim=-1)

Training Strategy¶

Two-stage approach:

Pre-training: Train VLM on web data (already done)
Co-fine-tuning: Fine-tune on robot data while preserving VLM capabilities

def co_finetune_rt2(vlm_model, robot_dataset, config):
    """Co-fine-tune VLM for robotics"""

    # Freeze most layers initially
    for param in vlm_model.parameters():
        param.requires_grad = False

    # Unfreeze last N layers
    for layer in vlm_model.layers[-config.unfreeze_layers:]:
        for param in layer.parameters():
            param.requires_grad = True

    # Unfreeze action token embeddings
    for param in vlm_model.action_token_embedding.parameters():
        param.requires_grad = True

    # Mixed objective: VLM loss + robot loss
    optimizer = torch.optim.AdamW(
        filter(lambda p: p.requires_grad, vlm_model.parameters()),
        lr=config.learning_rate
    )

    for batch in robot_dataset:
        # Robot control loss
        predicted_actions = vlm_model(batch['image'], batch['instruction'])
        robot_loss = F.mse_loss(predicted_actions, batch['action'])

        # Optional: Mix with VLM objectives
        if config.use_vlm_loss:
            vlm_loss = vlm_model.compute_vlm_loss(batch)
            total_loss = robot_loss + config.vlm_loss_weight * vlm_loss
        else:
            total_loss = robot_loss

        # Update
        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()

Emergent Capabilities¶

RT-2 exhibits remarkable capabilities inherited from web pre-training:

1. Reasoning from Visual Observations¶

Instruction: "Pick up the object used for cutting paper"
RT-2: [Identifies scissors among multiple objects]

2. Mathematical/Symbolic Reasoning¶

Instruction: "Move to the sum of 2+1"
RT-2: [Moves to position 3]

3. Multi-lingual Understanding¶

Instruction (Spanish): "Recoge el objeto rojo"
RT-2: [Picks up red object, despite training only on English]

4. Chain-of-Thought Planning¶

Instruction: "Clear the table"
RT-2: [Generates plan:
  1. Identify objects on table
  2. Pick closest object
  3. Move to bin
  4. Drop
  5. Repeat]

Performance Comparison¶

Metric	RT-1	RT-2-PaLI	RT-2-PaLM-E
Seen Tasks	97%	93%	90%
Novel Tasks	76%	89%	93%
Emergent Skills	Limited	Good	Excellent
Zero-shot Symbols	0%	67%	84%
Model Size	35M	5B	562B

Implementation Details¶

Tokenization Strategy:

class ActionTokenizer:
    """Tokenize actions for RT-2"""
    def __init__(self, action_dim=7, bins_per_dim=256):
        self.action_dim = action_dim
        self.bins_per_dim = bins_per_dim

        # Create vocabulary: [ACTION_0_BIN_0, ACTION_0_BIN_1, ..., ACTION_6_BIN_255]
        self.vocab_size = action_dim * bins_per_dim

        # Special tokens
        self.ACTION_START = self.vocab_size
        self.ACTION_END = self.vocab_size + 1

    def tokenize(self, actions):
        """Convert continuous actions to tokens"""
        # actions: (batch, action_dim) in [-1, 1]

        # Discretize each dimension
        actions_normalized = (actions + 1) / 2  # to [0, 1]
        actions_discrete = (actions_normalized * (self.bins_per_dim - 1)).long()

        # Convert to flat token IDs
        tokens = []
        for i in range(self.action_dim):
            token_id = i * self.bins_per_dim + actions_discrete[:, i]
            tokens.append(token_id)

        tokens = torch.stack(tokens, dim=1)  # (batch, action_dim)

        return tokens

    def detokenize(self, tokens):
        """Convert tokens back to continuous actions"""
        # tokens: (batch, action_dim)

        actions = []
        for i in range(self.action_dim):
            # Extract bin index for this dimension
            bin_idx = tokens[:, i] % self.bins_per_dim

            # Convert to continuous
            action_dim = bin_idx.float() / (self.bins_per_dim - 1)  # to [0, 1]
            action_dim = action_dim * 2 - 1  # to [-1, 1]

            actions.append(action_dim)

        return torch.stack(actions, dim=1)

Training Loop:

def train_rt2(model, dataloader, config):
    """Training loop for RT-2"""
    model.train()

    for epoch in range(config.num_epochs):
        for batch in dataloader:
            # Get inputs
            images = batch['images'].cuda()
            instructions = batch['instructions']
            actions = batch['actions'].cuda()

            # Tokenize actions
            action_tokens = model.tokenizer.tokenize(actions)

            # Forward pass
            logits = model(images, instructions, target_tokens=action_tokens)

            # Compute loss (cross-entropy over action token vocabulary)
            loss = F.cross_entropy(
                logits.view(-1, model.vocab_size),
                action_tokens.view(-1)
            )

            # Backward
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

Deployment Considerations¶

Inference Speed: - RT-1: ~3 Hz (real-time capable) - RT-2-PaLI: ~1 Hz - RT-2-PaLM-E: ~0.2 Hz (too slow for real-time)

Solutions for RT-2:

Model Distillation:

# Distill RT-2-PaLM-E → RT-2-PaLI → RT-1-sized model
teacher_model = RT2_PaLM_E()
student_model = RT1()

for batch in dataset:
    # Get teacher predictions (with temperature)
    with torch.no_grad():
        teacher_logits = teacher_model(batch) / temperature
        teacher_probs = F.softmax(teacher_logits, dim=-1)

    # Train student to match
    student_logits = student_model(batch)
    student_probs = F.softmax(student_logits, dim=-1)

    # KL divergence loss
    distill_loss = F.kl_div(
        student_probs.log(),
        teacher_probs,
        reduction='batchmean'
    )

    distill_loss.backward()
    optimizer.step()

Quantization: INT8 or INT4 quantization for faster inference
Pruning: Remove redundant weights while maintaining performance

Practical Implementation¶

Complete RT-2 Style Model¶

import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer

class RT2LikeModel(nn.Module):
    """RT-2 style model using pre-trained VLM"""
    def __init__(self, vlm_name='google/paligemma-3b-pt-224', action_dim=7, bins=256):
        super().__init__()

        # Load pre-trained VLM
        self.vlm = AutoModel.from_pretrained(vlm_name)
        self.tokenizer = AutoTokenizer.from_pretrained(vlm_name)

        # Action tokenizer
        self.action_tokenizer = ActionTokenizer(action_dim, bins)

        # Extend embeddings for action tokens
        self.action_embeddings = nn.Embedding(
            action_dim * bins,
            self.vlm.config.hidden_size
        )

        # Action prediction head
        self.action_head = nn.Linear(
            self.vlm.config.hidden_size,
            action_dim * bins
        )

    def forward(self, images, instructions, actions=None):
        # Tokenize instruction
        text_tokens = self.tokenizer(
            instructions,
            return_tensors='pt',
            padding=True
        ).to(images.device)

        # VLM encoding
        outputs = self.vlm(
            pixel_values=images,
            input_ids=text_tokens['input_ids'],
            attention_mask=text_tokens['attention_mask']
        )

        # Get last hidden state
        hidden_states = outputs.last_hidden_state

        # Predict actions
        action_logits = self.action_head(hidden_states[:, -1, :])

        # Reshape to (batch, action_dim, bins)
        action_logits = action_logits.view(-1, self.action_tokenizer.action_dim, self.action_tokenizer.bins_per_dim)

        if actions is not None:
            # Training: compute loss
            action_tokens = self.action_tokenizer.tokenize(actions)
            loss = F.cross_entropy(
                action_logits.view(-1, self.action_tokenizer.bins_per_dim),
                action_tokens.view(-1)
            )
            return loss
        else:
            # Inference: sample actions
            action_tokens = torch.argmax(action_logits, dim=-1)
            predicted_actions = self.action_tokenizer.detokenize(action_tokens)
            return predicted_actions

Key Takeaways¶

RT-1¶

✓ Real-time capable (3 Hz)
✓ Efficient architecture
✓ Proven on real robots
✗Limited generalization to novel concepts

RT-2¶

✓ Exceptional generalization (inherited from web data)
✓ Emergent reasoning capabilities
✓ Multi-lingual support
✗Slower inference (requires optimization)
✗Requires large-scale pre-trained VLMs

When to Use¶

Use RT-1 when: - Real-time control is critical - Limited compute resources - Task distribution is well-defined

Use RT-2 when: - Novel task generalization is important - Reasoning/planning capabilities needed - Can tolerate slower inference or invest in optimization

References¶

Brohan et al., "RT-1: Robotics Transformer for Real-World Control at Scale", CoRL 2022
Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", arXiv 2023
Zitkovich et al., "RT-X: Generalizing to New Embodiments", arXiv 2023

Next Steps¶

OpenVLA - Open-source VLA model
Octo - Generalist robot policy
Custom Architectures - Build your own VLA