Skip to content

RL Algorithms for Robotics

This page covers popular reinforcement learning algorithms commonly used in robotics.

Algorithm Comparison

Algorithm Type Action Space Sample Efficiency Stability Use Case
PPO On-policy Continuous/Discrete Medium High General purpose
SAC Off-policy Continuous High Medium Complex tasks
TD3 Off-policy Continuous High Medium Continuous control
DQN Off-policy Discrete Medium Medium Discrete actions
DDPG Off-policy Continuous High Low Continuous control

Policy Gradient Methods

Proximal Policy Optimization (PPO)

PPO is currently the most popular RL algorithm for robotics due to its stability and ease of use.

Key Idea: Limit policy updates to stay close to previous policy.

Algorithm:

# Collect trajectories with current policy
for epoch in epochs:
    trajectories = collect_rollouts(policy, num_steps)

    # Compute advantages
    advantages = compute_gae(trajectories, value_function)

    # Multiple epochs of minibatch updates
    for _ in range(K):
        for minibatch in get_minibatches(trajectories):
            # Compute probability ratio
            ratio = policy(actions|states) / old_policy(actions|states)

            # Clipped surrogate objective
            L_clip = min(
                ratio * advantages,
                clip(ratio, 1-ε, 1+ε) * advantages
            )

            # Value loss
            L_vf = (value_function(states) - returns)²

            # Total loss
            loss = -L_clip + c*L_vf - c*entropy

            # Update
            optimizer.step(loss)

Hyperparameters:

ppo:
  learning_rate: 3e-4
  n_steps: 2048
  batch_size: 64
  n_epochs: 10
  gamma: 0.99
  gae_lambda: 0.95
  clip_range: 0.2
  vf_coef: 0.5
  ent_coef: 0.01

Implementation:

from stable_baselines3 import PPO

model = PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    verbose=1
)

model.learn(total_timesteps=1_000_000)

Pros: - Very stable - Easy to tune - Works well in most scenarios

Cons: - Sample inefficient (on-policy) - Requires many environment steps

Trust Region Policy Optimization (TRPO)

Predecessor to PPO with theoretical guarantees.

Key Idea: Constrain KL divergence between old and new policy.

\[ \max_\theta \mathbb{E} \left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A(s,a) \right] $$ $$ \text{s.t. } \mathbb{E}[KL(\pi_{\theta_{old}} || \pi_\theta)] \leq \delta \]

Pros: - Strong theoretical guarantees - Monotonic improvement

Cons: - Complex implementation - Computationally expensive - PPO often works as well in practice

Q-Learning Methods

Soft Actor-Critic (SAC)

State-of-the-art off-policy algorithm for continuous control.

Key Idea: Maximum entropy RL - maximize reward while staying as random as possible.

Objective:

\[ J(\pi) = \sum_t \mathbb{E}_{(s_t,a_t) \sim \rho_\pi} [r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t))] \]

Algorithm:

# Q-function update
Q_target = r + γ(min_{i=1,2} Q_{target,i}(s', a') - α log π(a'|s'))
L_Q = (Q(s,a) - Q_target)²

# Policy update
L_π = α log π(a|s) - Q(s,a)

# Temperature update (automatic tuning)
L_α = -α(log π(a|s) + H_target)

Implementation:

from stable_baselines3 import SAC

model = SAC(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    buffer_size=1_000_000,
    learning_starts=10000,
    batch_size=256,
    tau=0.005,
    gamma=0.99,
    train_freq=1,
    gradient_steps=1,
    ent_coef='auto',  # Automatic temperature tuning
    verbose=1
)

model.learn(total_timesteps=1_000_000)

Pros: - Sample efficient (off-policy) - Very stable - Automatic temperature tuning

Cons: - Requires more memory (replay buffer) - Slower wall-clock time per step

Twin Delayed DDPG (TD3)

Improved version of DDPG with several stabilization tricks.

Key Improvements over DDPG:

  1. Twin Q-networks: Use minimum of two Q-functions
  2. Delayed policy updates: Update policy less frequently
  3. Target policy smoothing: Add noise to target actions

Algorithm:

# Critic update
a' = clip(π_target(s') + ε, a_low, a_high)
y = r + γ min_{i=1,2} Q_{target,i}(s', a')
L_Q = (Q_1(s,a) - y)² + (Q_2(s,a) - y)²

# Delayed actor update (every d steps)
if step % d == 0:
    L_π = -Q_1(s, π(s))
    update_target_networks()

Pros: - More stable than DDPG - Good performance on continuous control

Cons: - More hyperparameters to tune - SAC often outperforms it

Deep Q-Networks (DQN)

For discrete action spaces.

Key Ideas: - Experience replay - Target network - Huber loss

Variants:

Addresses overestimation bias:

a_max = argmax_a Q(s', a; θ)
y = r + γ Q(s', a_max; θ_target)

Separate value and advantage streams:

Q(s,a) = V(s) + (A(s,a) - mean_a A(s,a))

Combines multiple improvements: - Double Q-learning - Prioritized replay - Dueling networks - Multi-step returns - Distributional RL - Noisy networks

Model-Based RL

Dreamer

Learn world model and use it for planning in latent space.

Components: 1. World model: Predicts next latent states and rewards 2. Actor: Policy in latent space 3. Critic: Value function in latent space

Workflow:

# Learn world model from experience
world_model.train(replay_buffer)

# Imagine trajectories in latent space
imagined_trajectories = world_model.imagine(policy, horizon=15)

# Train policy on imagined data
policy.train(imagined_trajectories)

Pros: - Very sample efficient - Can plan ahead - Learns reusable world models

Cons: - Complex to implement - Model errors can compound

MBPO (Model-Based Policy Optimization)

Combines model-based and model-free RL.

Algorithm: 1. Collect data with current policy 2. Train dynamics model 3. Generate synthetic rollouts with model 4. Train policy with both real and synthetic data

Pros: - More sample efficient than model-free - More robust than pure model-based

Robotics-Specific Algorithms

Hindsight Experience Replay (HER)

For sparse reward tasks.

Key Idea: After failing to reach goal, pretend you wanted to reach where you ended up.

# Original trajectory (failed)
s a s a s ...  s_T (didn't reach g)

# Hindsight trajectory (success!)
s a s a s ...  s_T (reached s_T!)
# Relabel goal: g' = s_T

Implementation:

from stable_baselines3 import HerReplayBuffer, SAC

model = SAC(
    "MultiInputPolicy",
    env,
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(
        n_sampled_goal=4,
        goal_selection_strategy='future'
    )
)

Residual RL

Learn correction on top of classical controller.

# Combine classical and learned control
action = classical_controller(state) + rl_policy(state)

Pros: - Safer (starts from working controller) - Faster learning - Better interpretability

Asymmetric Actor-Critic

Use privileged information during training.

# Training: Critic has access to privileged info
Q(s_privileged, a)  # s_privileged includes ground truth state

# Deployment: Actor only uses sensors
π(s_sensors)

Use Case: Training in simulation with perfect state, deploying with noisy sensors.

Algorithm Selection Guide

Choose PPO if:

  • General purpose robotics task
  • You want stability and ease of use
  • Sample efficiency is not critical
  • You have parallel simulation

Choose SAC if:

  • Continuous control task
  • You need sample efficiency
  • You have a replay buffer
  • You want automatic exploration tuning

Choose TD3 if:

  • Continuous control
  • You need stability
  • SAC is not performing well

Choose Model-Based (MBPO, Dreamer) if:

  • Sample efficiency is critical
  • Real-world data is expensive
  • Task has learnable dynamics

Choose HER if:

  • Sparse rewards
  • Goal-conditioned task
  • Reaching/manipulation task

Hyperparameter Tuning

General Tips

  1. Learning rate: Start with 3e-4, adjust if unstable
  2. Discount factor (γ): 0.99 for long horizon, 0.9 for short
  3. Batch size: Larger is more stable but slower
  4. Network architecture: [256, 256] is a good default

PPO Specific

# Conservative (stable)
clip_range: 0.1
ent_coef: 0.01
vf_coef: 0.5

# Aggressive (faster learning)
clip_range: 0.3
ent_coef: 0.001
vf_coef: 1.0

SAC Specific

# Sample efficient
buffer_size: 1_000_000
batch_size: 256
learning_starts: 10000

# Memory constrained
buffer_size: 100_000
batch_size: 128
learning_starts: 1000

Next Steps