RL Algorithms for Robotics¶

This page covers popular reinforcement learning algorithms commonly used in robotics.

Algorithm Comparison¶

Algorithm	Type	Action Space	Sample Efficiency	Stability	Use Case
PPO	On-policy	Continuous/Discrete	Medium	High	General purpose
SAC	Off-policy	Continuous	High	Medium	Complex tasks
TD3	Off-policy	Continuous	High	Medium	Continuous control
DQN	Off-policy	Discrete	Medium	Medium	Discrete actions
DDPG	Off-policy	Continuous	High	Low	Continuous control

Policy Gradient Methods¶

Proximal Policy Optimization (PPO)¶

PPO is currently the most popular RL algorithm for robotics due to its stability and ease of use.

Key Idea: Limit policy updates to stay close to previous policy.

Algorithm:

# Collect trajectories with current policy
for epoch in epochs:
    trajectories = collect_rollouts(policy, num_steps)

    # Compute advantages
    advantages = compute_gae(trajectories, value_function)

    # Multiple epochs of minibatch updates
    for _ in range(K):
        for minibatch in get_minibatches(trajectories):
            # Compute probability ratio
            ratio = policy(actions|states) / old_policy(actions|states)

            # Clipped surrogate objective
            L_clip = min(
                ratio * advantages,
                clip(ratio, 1-ε, 1+ε) * advantages
            )

            # Value loss
            L_vf = (value_function(states) - returns)²

            # Total loss
            loss = -L_clip + c₁*L_vf - c₂*entropy

            # Update
            optimizer.step(loss)

Hyperparameters:

ppo:
  learning_rate: 3e-4
  n_steps: 2048
  batch_size: 64
  n_epochs: 10
  gamma: 0.99
  gae_lambda: 0.95
  clip_range: 0.2
  vf_coef: 0.5
  ent_coef: 0.01

Implementation:

from stable_baselines3 import PPO

model = PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    verbose=1
)

model.learn(total_timesteps=1_000_000)

Pros: - Very stable - Easy to tune - Works well in most scenarios

Cons: - Sample inefficient (on-policy) - Requires many environment steps

Trust Region Policy Optimization (TRPO)¶

Predecessor to PPO with theoretical guarantees.

Key Idea: Constrain KL divergence between old and new policy.

\[ \max_\theta \mathbb{E} \left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A(s,a) \right] $$ $$ \text{s.t. } \mathbb{E}[KL(\pi_{\theta_{old}} || \pi_\theta)] \leq \delta \]

Pros: - Strong theoretical guarantees - Monotonic improvement

Cons: - Complex implementation - Computationally expensive - PPO often works as well in practice

Q-Learning Methods¶

Soft Actor-Critic (SAC)¶

State-of-the-art off-policy algorithm for continuous control.

Key Idea: Maximum entropy RL - maximize reward while staying as random as possible.

Objective:

\[ J(\pi) = \sum_t \mathbb{E}_{(s_t,a_t) \sim \rho_\pi} [r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t))] \]

Algorithm:

# Q-function update
Q_target = r + γ(min_{i=1,2} Q_{target,i}(s', a') - α log π(a'|s'))
L_Q = (Q(s,a) - Q_target)²

# Policy update
L_π = α log π(a|s) - Q(s,a)

# Temperature update (automatic tuning)
L_α = -α(log π(a|s) + H_target)

Implementation:

from stable_baselines3 import SAC

model = SAC(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    buffer_size=1_000_000,
    learning_starts=10000,
    batch_size=256,
    tau=0.005,
    gamma=0.99,
    train_freq=1,
    gradient_steps=1,
    ent_coef='auto',  # Automatic temperature tuning
    verbose=1
)

model.learn(total_timesteps=1_000_000)

Pros: - Sample efficient (off-policy) - Very stable - Automatic temperature tuning

Cons: - Requires more memory (replay buffer) - Slower wall-clock time per step

Twin Delayed DDPG (TD3)¶

Improved version of DDPG with several stabilization tricks.

Key Improvements over DDPG:

Twin Q-networks: Use minimum of two Q-functions
Delayed policy updates: Update policy less frequently
Target policy smoothing: Add noise to target actions

Algorithm:

# Critic update
a' = clip(π_target(s') + ε, a_low, a_high)
y = r + γ min_{i=1,2} Q_{target,i}(s', a')
L_Q = (Q_1(s,a) - y)² + (Q_2(s,a) - y)²

# Delayed actor update (every d steps)
if step % d == 0:
    L_π = -Q_1(s, π(s))
    update_target_networks()

Pros: - More stable than DDPG - Good performance on continuous control

Cons: - More hyperparameters to tune - SAC often outperforms it

Deep Q-Networks (DQN)¶

For discrete action spaces.

Key Ideas: - Experience replay - Target network - Huber loss

Variants:

Double DQNDueling DQNRainbow DQN

Addresses overestimation bias:

a_max = argmax_a Q(s', a; θ)
y = r + γ Q(s', a_max; θ_target)

Separate value and advantage streams:

Q(s,a) = V(s) + (A(s,a) - mean_a A(s,a))

Combines multiple improvements: - Double Q-learning - Prioritized replay - Dueling networks - Multi-step returns - Distributional RL - Noisy networks

Model-Based RL¶

Dreamer¶

Learn world model and use it for planning in latent space.

Components: 1. World model: Predicts next latent states and rewards 2. Actor: Policy in latent space 3. Critic: Value function in latent space

Workflow:

# Learn world model from experience
world_model.train(replay_buffer)

# Imagine trajectories in latent space
imagined_trajectories = world_model.imagine(policy, horizon=15)

# Train policy on imagined data
policy.train(imagined_trajectories)

Pros: - Very sample efficient - Can plan ahead - Learns reusable world models

Cons: - Complex to implement - Model errors can compound

MBPO (Model-Based Policy Optimization)¶

Combines model-based and model-free RL.

Algorithm: 1. Collect data with current policy 2. Train dynamics model 3. Generate synthetic rollouts with model 4. Train policy with both real and synthetic data

Pros: - More sample efficient than model-free - More robust than pure model-based

Robotics-Specific Algorithms¶

Hindsight Experience Replay (HER)¶

For sparse reward tasks.

Key Idea: After failing to reach goal, pretend you wanted to reach where you ended up.

# Original trajectory (failed)
s₀ →a₀ s₁ →a₁ s₂ ... → s_T (didn't reach g)

# Hindsight trajectory (success!)
s₀ →a₀ s₁ →a₁ s₂ ... → s_T (reached s_T!)
# Relabel goal: g' = s_T

Implementation:

from stable_baselines3 import HerReplayBuffer, SAC

model = SAC(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(
        n_sampled_goal=4,
        goal_selection_strategy='future'
    )
)

Residual RL¶

Learn correction on top of classical controller.

# Combine classical and learned control
action = classical_controller(state) + rl_policy(state)

Pros: - Safer (starts from working controller) - Faster learning - Better interpretability

Asymmetric Actor-Critic¶

Use privileged information during training.

# Training: Critic has access to privileged info
Q(s_privileged, a)  # s_privileged includes ground truth state

# Deployment: Actor only uses sensors
π(s_sensors)

Use Case: Training in simulation with perfect state, deploying with noisy sensors.

Algorithm Selection Guide¶

Choose PPO if:¶

General purpose robotics task
You want stability and ease of use
Sample efficiency is not critical
You have parallel simulation

Choose SAC if:¶

Continuous control task
You need sample efficiency
You have a replay buffer
You want automatic exploration tuning

Choose TD3 if:¶

Continuous control
You need stability
SAC is not performing well

Choose Model-Based (MBPO, Dreamer) if:¶

Sample efficiency is critical
Real-world data is expensive
Task has learnable dynamics

Choose HER if:¶

Sparse rewards
Goal-conditioned task
Reaching/manipulation task

Hyperparameter Tuning¶

General Tips¶

Learning rate: Start with 3e-4, adjust if unstable
Discount factor (γ): 0.99 for long horizon, 0.9 for short
Batch size: Larger is more stable but slower
Network architecture: [256, 256] is a good default

PPO Specific¶

# Conservative (stable)
clip_range: 0.1
ent_coef: 0.01
vf_coef: 0.5

# Aggressive (faster learning)
clip_range: 0.3
ent_coef: 0.001
vf_coef: 1.0

SAC Specific¶

# Sample efficient
buffer_size: 1_000_000
batch_size: 256
learning_starts: 10000

# Memory constrained
buffer_size: 100_000
batch_size: 128
learning_starts: 1000

Next Steps¶

Training Guide - Train these algorithms
Evaluation - Evaluate and compare algorithms
Simulators - Set up training environments