RL Games: Fast GPU-Accelerated RL¶
RL Games is a high-performance RL library optimized for GPU-accelerated training with massive parallelization, achieving state-of-the-art speed.
Overview¶
RL Games is developed by NVIDIA for:
- Extreme parallelization (10,000+ environments on single GPU)
- Multiple algorithms (PPO, SAC, A2C, DQN)
- Isaac Gym/Lab integration
- Fastest RL training available (2-10x faster than alternatives)
Key Features: - ✓ Multi-GPU and multi-node training - ✓ Mixed precision (FP16) support - ✓ Recurrent policies (LSTM, GRU) - ✓ Self-play for competitive tasks - ✓ Asymmetric actor-critic - ✓ TensorBoard and WandB logging
Official Repository: https://github.com/Denys88/rl_games
Installation¶
Basic Installation¶
# Install from PyPI
pip install rl-games
# Or from source (for latest features)
git clone https://github.com/Denys88/rl_games.git
cd rl_games
pip install -e .
With Isaac Gym¶
# Install Isaac Gym first
# Download from: https://developer.nvidia.com/isaac-gym
cd isaacgym/python
pip install -e .
# Then install RL Games
pip install rl-games
# Verify installation
python -c "import rl_games; print(rl_games.__version__)"
Dependencies¶
Quick Start¶
Training with Config File¶
RL Games uses YAML configuration files:
# config.yaml
params:
algo:
name: a2c_continuous # PPO variant
model:
name: continuous_a2c_logstd
network:
name: actor_critic
separate: False # Shared backbone
space:
continuous:
mu_activation: None
sigma_activation: None
mu_init:
name: default
sigma_init:
name: const_initializer
val: 0 # Log std initialization
fixed_sigma: True
mlp:
units: [256, 128, 64]
activation: elu
initializer:
name: default
regularizer:
name: None
load_checkpoint: False
load_path: ''
config:
name: Ant
env_name: isaac-gym
multi_gpu: False
ppo: True
mixed_precision: True # FP16 for speed
normalize_input: True
normalize_value: True
reward_shaper:
scale_value: 0.01
normalize_advantage: True
gamma: 0.99
tau: 0.95 # GAE lambda
learning_rate: 3e-4
lr_schedule: adaptive
kl_threshold: 0.008
score_to_win: 20000
max_epochs: 5000
save_best_after: 100
save_frequency: 50
print_stats: True
grad_norm: 1.0
entropy_coef: 0.0
truncate_grads: True
e_clip: 0.2
horizon_length: 16
minibatch_size: 32768
mini_epochs: 5
critic_coef: 2
clip_value: True
seq_len: 4
bounds_loss_coef: 0.0001
Python Training Script¶
from rl_games.torch_runner import Runner
from rl_games.algos_torch import torch_ext
import yaml
def train():
# Load config
with open('config.yaml', 'r') as f:
config = yaml.safe_load(f)
# Create runner
runner = Runner()
runner.load(config)
# Register custom environment if needed
# runner.register_env('my_env', create_env_fn)
# Train
runner.run({
'train': True,
'play': False,
'checkpoint': '', # Path to resume from
'sigma': None # Override exploration noise
})
if __name__ == "__main__":
train()
Running Training¶
# Train with config
python runner.py --train --file config.yaml
# Resume from checkpoint
python runner.py --train --file config.yaml --checkpoint runs/Ant/nn/Ant.pth
# Play (evaluate) trained policy
python runner.py --play --file config.yaml --checkpoint runs/Ant/nn/Ant.pth
# Multi-GPU training
python runner.py --train --file config.yaml --num_actors 2
Core Components¶
1. PPO Implementation¶
RL Games' PPO is highly optimized for GPU:
from rl_games.algos_torch import a2c_continuous
from rl_games.common import vecenv
class A2CAgent(a2c_continuous.A2CAgent):
"""
PPO implementation (called A2C but is actually PPO)
Key optimizations:
- Full GPU pipeline (no CPU<->GPU transfers)
- Mixed precision training
- Vectorized advantage computation
- Multi-GPU support via data parallelism
"""
def __init__(self, base_name, config):
super().__init__(base_name, config)
# Network
self.model = self.network.build(config['network'])
# Optimizer
self.optimizer = torch.optim.Adam(
self.model.parameters(),
lr=config['learning_rate'],
eps=1e-5
)
# Mixed precision
if config['mixed_precision']:
self.scaler = torch.cuda.amp.GradScaler()
# Multi-GPU
if config['multi_gpu']:
self.model = torch.nn.DataParallel(self.model)
def train_epoch(self):
"""Single PPO epoch"""
# Collect experience
batch_dict = self.play_steps()
# Compute returns and advantages
self.compute_returns(batch_dict)
# Mini-batch updates
for _ in range(self.mini_epochs):
for batch in self.dataset.mini_batches(self.minibatch_size):
# Forward pass
with torch.cuda.amp.autocast(enabled=self.mixed_precision):
res_dict = self.model(batch)
# Compute losses
losses = self.calc_losses(batch, res_dict)
total_loss = losses['total_loss']
# Backward pass
self.optimizer.zero_grad()
if self.mixed_precision:
self.scaler.scale(total_loss).backward()
self.scaler.unscale_(self.optimizer)
torch.nn.utils.clip_grad_norm_(
self.model.parameters(),
self.grad_norm
)
self.scaler.step(self.optimizer)
self.scaler.update()
else:
total_loss.backward()
torch.nn.utils.clip_grad_norm_(
self.model.parameters(),
self.grad_norm
)
self.optimizer.step()
return losses
def calc_losses(self, batch, res_dict):
"""Calculate PPO losses"""
# Unpack
actions = batch['actions']
old_log_probs = batch['old_log_probs']
advantages = batch['advantages']
returns = batch['returns']
old_values = batch['old_values']
# Current policy
log_probs = res_dict['log_probs']
values = res_dict['values']
entropy = res_dict['entropy']
# PPO clipped loss
ratio = torch.exp(log_probs - old_log_probs)
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1.0 - self.e_clip, 1.0 + self.e_clip) * advantages
policy_loss = -torch.min(surr1, surr2).mean()
# Value loss (clipped)
if self.clip_value:
value_pred_clipped = old_values + torch.clamp(
values - old_values, -self.e_clip, self.e_clip
)
value_losses = (values - returns).pow(2)
value_losses_clipped = (value_pred_clipped - returns).pow(2)
value_loss = torch.max(value_losses, value_losses_clipped).mean()
else:
value_loss = (returns - values).pow(2).mean()
# Total loss
total_loss = (
policy_loss
+ self.critic_coef * value_loss
- self.entropy_coef * entropy.mean()
)
return {
'total_loss': total_loss,
'policy_loss': policy_loss,
'value_loss': value_loss,
'entropy': entropy.mean()
}
2. Network Architectures¶
RL Games supports various network architectures:
from rl_games.algos_torch import network_builder
class ActorCriticBuilder(network_builder.NetworkBuilder):
"""Build actor-critic networks"""
def __init__(self, **kwargs):
super().__init__(**kwargs)
def build(self, name, **kwargs):
"""Build network from config"""
net = ActorCriticNetwork(
input_shape=kwargs['input_shape'],
actions_num=kwargs['actions_num'],
mlp_units=kwargs['mlp']['units'],
activation=kwargs['mlp']['activation'],
separate=kwargs['separate']
)
return net
class ActorCriticNetwork(nn.Module):
"""
Actor-Critic network with optional shared backbone
Supports:
- Separate or shared actor/critic
- LSTM/GRU for recurrent policies
- Multi-head outputs (e.g., for multi-task)
"""
def __init__(
self,
input_shape,
actions_num,
mlp_units=[256, 128, 64],
activation='elu',
separate=False,
use_rnn=False,
rnn_units=256
):
super().__init__()
self.separate = separate
self.use_rnn = use_rnn
# Activation
act = get_activation(activation)
if separate:
# Separate actor and critic
self.actor = build_mlp(input_shape, mlp_units, act)
self.critic = build_mlp(input_shape, mlp_units, act)
actor_input_size = mlp_units[-1]
critic_input_size = mlp_units[-1]
else:
# Shared backbone
self.backbone = build_mlp(input_shape, mlp_units, act)
actor_input_size = mlp_units[-1]
critic_input_size = mlp_units[-1]
# RNN (optional)
if use_rnn:
self.rnn = nn.LSTM(
input_size=mlp_units[-1],
hidden_size=rnn_units,
num_layers=1,
batch_first=True
)
actor_input_size = rnn_units
critic_input_size = rnn_units
# Actor head (policy)
self.mu = nn.Linear(actor_input_size, actions_num)
self.log_std = nn.Parameter(torch.zeros(actions_num))
# Critic head (value)
self.value = nn.Linear(critic_input_size, 1)
# Initialize
self.apply(init_weights)
def forward(self, obs, rnn_states=None):
"""Forward pass"""
if self.separate:
actor_features = self.actor(obs)
critic_features = self.critic(obs)
else:
features = self.backbone(obs)
actor_features = features
critic_features = features
# RNN
if self.use_rnn:
batch_size = obs.shape[0]
seq_len = obs.shape[1]
actor_features, rnn_states = self.rnn(
actor_features.view(batch_size, seq_len, -1),
rnn_states
)
critic_features = actor_features
# Policy
mu = self.mu(actor_features)
std = torch.exp(self.log_std)
# Sample actions
dist = torch.distributions.Normal(mu, std)
actions = dist.sample()
log_probs = dist.log_prob(actions).sum(dim=-1)
entropy = dist.entropy().sum(dim=-1)
# Value
values = self.value(critic_features).squeeze(-1)
return {
'actions': actions,
'log_probs': log_probs,
'entropy': entropy,
'values': values,
'rnn_states': rnn_states
}
def build_mlp(input_size, units, activation):
"""Build MLP"""
layers = []
prev_size = input_size
for size in units:
layers.append(nn.Linear(prev_size, size))
layers.append(activation())
prev_size = size
return nn.Sequential(*layers)
def get_activation(name):
"""Get activation function"""
if name == 'elu':
return nn.ELU
elif name == 'relu':
return nn.ReLU
elif name == 'tanh':
return nn.Tanh
else:
raise ValueError(f"Unknown activation: {name}")
def init_weights(module):
"""Orthogonal initialization"""
if isinstance(module, nn.Linear):
nn.init.orthogonal_(module.weight, gain=1.0)
if module.bias is not None:
nn.init.constant_(module.bias, 0.0)
3. Custom Environments¶
Register custom environments:
from rl_games.common import env_configurations
from rl_games.common import vecenv
def create_my_env(**kwargs):
"""Create custom environment"""
import gym
# Your custom environment
env = gym.make('MyEnv-v0')
# Wrap for vectorization
return env
# Register environment
env_configurations.register(
'my_env',
{
'vecenv_type': 'RAY', # or 'SIMPLE'
'env_creator': lambda **kwargs: create_my_env(**kwargs)
}
)
# Use in config.yaml
# config:
# env_name: my_env
Advanced Features¶
Mixed Precision Training¶
Enables FP16 for 2x speedup:
Speedup: 1.5-2x faster, half memory usage
Multi-GPU Training¶
Speedup: Nearly linear with GPU count (1.8-1.9x with 2 GPUs)
Recurrent Policies (LSTM/GRU)¶
For partially observable tasks:
network:
name: actor_critic
rnn:
name: lstm
units: 256
layers: 1
before_mlp: False # RNN after MLP
mlp:
units: [256, 128]
activation: elu
When to use: - Partial observability (e.g., limited sensor range) - Tasks requiring memory (e.g., navigation) - Time-series tasks
Self-Play¶
For competitive multi-agent tasks:
config:
self_play_config:
use_selfplay: True
save_every_steps: 10000
swap_steps: 8000 # Swap opponent every N steps
games_to_check: 200 # Games to determine winner
win_rate: 0.55 # Required win rate to update opponent
Asymmetric Actor-Critic¶
Different observations for actor (deployed) vs critic (training):
network:
name: actor_critic_asymmetric
actor:
input_shape: [48] # Deployed observations
mlp:
units: [256, 128, 64]
critic:
input_shape: [187] # Privileged observations
mlp:
units: [512, 256, 128]
Use case: Sim-to-real (critic sees perfect sim state, actor sees real observations)
Isaac Gym Integration¶
Complete Example¶
"""
train_humanoid.py
Train humanoid with RL Games + Isaac Gym
"""
from rl_games.torch_runner import Runner
from isaacgym import gymapi
import yaml
# Isaac Gym configuration
gym_config = {
"name": "Humanoid",
"physics_engine": gymapi.SIM_PHYSX,
"sim": {
"dt": 0.0166,
"substeps": 2,
"up_axis": "z",
"use_gpu_pipeline": True,
"physx": {
"num_threads": 4,
"solver_type": 1,
"num_position_iterations": 4,
"num_velocity_iterations": 0,
"contact_offset": 0.002,
"rest_offset": 0.0,
"bounce_threshold_velocity": 0.2,
"max_depenetration_velocity": 10.0,
"default_buffer_size_multiplier": 2.0
}
},
"task": {
"randomize": True,
"randomization_params": {
"frequency": 600,
"observations": {
"range": [0, 0.002],
"operation": "additive"
},
"actions": {
"range": [0.0, 0.02],
"operation": "additive"
},
"sim_params": {
"gravity": {
"range": [0, 0.4],
"operation": "additive"
}
},
"actor_params": {
"humanoid": {
"color": True,
"dof_properties": {
"damping": {
"range": [0.5, 1.5],
"operation": "scaling"
},
"stiffness": {
"range": [0.5, 1.5],
"operation": "scaling"
}
},
"rigid_body_properties": {
"mass": {
"range": [0.5, 1.5],
"operation": "scaling"
}
}
}
}
}
},
"env": {
"numEnvs": 4096,
"envSpacing": 5,
"enableDebugVis": False
}
}
# RL Games configuration
rl_config = {
"params": {
"algo": {
"name": "a2c_continuous"
},
"model": {
"name": "continuous_a2c_logstd"
},
"network": {
"name": "actor_critic",
"separate": False,
"space": {
"continuous": {
"mu_activation": "None",
"sigma_activation": "None",
"mu_init": {"name": "default"},
"sigma_init": {"name": "const_initializer", "val": 0},
"fixed_sigma": True
}
},
"mlp": {
"units": [400, 200, 100],
"activation": "elu",
"initializer": {"name": "default"}
}
},
"config": {
"name": "Humanoid",
"env_name": "isaac",
"multi_gpu": False,
"ppo": True,
"mixed_precision": True,
"normalize_input": True,
"normalize_value": True,
"reward_shaper": {"scale_value": 0.01},
"normalize_advantage": True,
"gamma": 0.99,
"tau": 0.95,
"learning_rate": 2e-4,
"lr_schedule": "adaptive",
"kl_threshold": 0.008,
"score_to_win": 10000,
"max_epochs": 10000,
"save_best_after": 100,
"save_frequency": 200,
"grad_norm": 1.0,
"entropy_coef": 0.0,
"e_clip": 0.2,
"horizon_length": 32,
"minibatch_size": 32768,
"mini_epochs": 5,
"critic_coef": 4,
"clip_value": True
}
}
}
def train():
# Create runner
runner = Runner()
runner.load(rl_config)
# Train
runner.run({
'train': True,
'play': False,
'checkpoint': ''
})
if __name__ == "__main__":
train()
Running on Isaac Gym¶
# Train
python train_humanoid.py
# With specific GPU
CUDA_VISIBLE_DEVICES=0 python train_humanoid.py
# Multi-GPU
python train_humanoid.py --num_actors 4
# Play trained policy
python train_humanoid.py --play --checkpoint runs/Humanoid/nn/Humanoid.pth
Performance Optimization¶
Benchmarks¶
Training speed on Humanoid task (4096 environments, RTX 3090):
| Configuration | FPS | Time to 10M steps |
|---|---|---|
| RL Games (FP16, GPU pipeline) | 145,000 | 68 seconds |
| RL Games (FP32, GPU pipeline) | 89,000 | 112 seconds |
| RSL-RL | 72,000 | 139 seconds |
| Stable-Baselines3 (CPU) | 3,500 | 47 minutes |
Speedup: 40x faster than SB3!
Optimization Tips¶
# Maximum speed configuration
config:
# GPU pipeline (critical!)
use_gpu_pipeline: True
# Mixed precision (2x speedup)
mixed_precision: True
# Large batches for GPU efficiency
minibatch_size: 32768
horizon_length: 32
# Normalize for stability
normalize_input: True
normalize_value: True
normalize_advantage: True
# Adaptive LR for speed
lr_schedule: adaptive
# Reduce logging overhead
print_stats: False # Only when debugging
# Efficient value loss
clip_value: True
Tips & Best Practices¶
Hyperparameter Tuning¶
Start with defaults:
config:
gamma: 0.99
tau: 0.95
learning_rate: 3e-4
e_clip: 0.2
mini_epochs: 5
horizon_length: 16
minibatch_size: 32768
For hard exploration:
config:
entropy_coef: 0.01 # Increase exploration
learning_rate: 1e-4 # Slower learning
tau: 0.9 # Lower GAE lambda
For sample efficiency:
config:
horizon_length: 64 # Longer rollouts
mini_epochs: 10 # More SGD steps
minibatch_size: 16384 # Smaller batches
Debugging¶
# Enable verbose logging
config:
print_stats: True
# Check gradients
config:
grad_norm: 0.5 # Lower if gradients exploding
# Visualize training
config:
enable_tensorboard: True
# Profile performance
import cProfile
cProfile.run('runner.run(args)', 'stats.prof')
# Analyze profile
import pstats
p = pstats.Stats('stats.prof')
p.sort_stats('cumulative').print_stats(20)
Common Issues¶
Problem: NaN losses
Problem: Slow convergence
Problem: OOM (Out of Memory)
References¶
Official Resources¶
- RL Games GitHub: https://github.com/Denys88/rl_games
- Documentation: https://github.com/Denys88/rl_games/tree/master/docs
- Isaac Gym: https://developer.nvidia.com/isaac-gym
Papers¶
- Isaac Gym: Makoviychuk et al., "Isaac Gym: High Performance GPU-Based Physics Simulation", NeurIPS 2021
- GPU RL: Makoviychuk & Makoviichuk, "RL Games: High Performance RL Library", 2021
Benchmarks¶
- Isaac Gym Benchmark: https://github.com/NVIDIA-Omniverse/IsaacGymEnvs
- Comprehensive benchmarks for all algorithms
- Pre-configured tasks
Next Steps¶
- RSL-RL - Alternative Isaac Lab RL library
- Stable-Baselines3 - More general-purpose RL
- Isaac Gym Envs - Benchmark environments