RL Games: Fast GPU-Accelerated RL¶

RL Games is a high-performance RL library optimized for GPU-accelerated training with massive parallelization, achieving state-of-the-art speed.

Overview¶

RL Games is developed by NVIDIA for:

Extreme parallelization (10,000+ environments on single GPU)
Multiple algorithms (PPO, SAC, A2C, DQN)
Isaac Gym/Lab integration
Fastest RL training available (2-10x faster than alternatives)

Key Features: - ✓ Multi-GPU and multi-node training - ✓ Mixed precision (FP16) support - ✓ Recurrent policies (LSTM, GRU) - ✓ Self-play for competitive tasks - ✓ Asymmetric actor-critic - ✓ TensorBoard and WandB logging

Official Repository: https://github.com/Denys88/rl_games

Installation¶

Basic Installation¶

# Install from PyPI
pip install rl-games

# Or from source (for latest features)
git clone https://github.com/Denys88/rl_games.git
cd rl_games
pip install -e .

With Isaac Gym¶

# Install Isaac Gym first
# Download from: https://developer.nvidia.com/isaac-gym
cd isaacgym/python
pip install -e .

# Then install RL Games
pip install rl-games

# Verify installation
python -c "import rl_games; print(rl_games.__version__)"

Dependencies¶

pip install torch numpy pyyaml tensorboard wandb gym

Quick Start¶

Training with Config File¶

RL Games uses YAML configuration files:

# config.yaml
params:
  algo:
    name: a2c_continuous  # PPO variant

  model:
    name: continuous_a2c_logstd

  network:
    name: actor_critic
    separate: False  # Shared backbone

    space:
      continuous:
        mu_activation: None
        sigma_activation: None
        mu_init:
          name: default
        sigma_init:
          name: const_initializer
          val: 0  # Log std initialization
        fixed_sigma: True

    mlp:
      units: [256, 128, 64]
      activation: elu
      initializer:
        name: default
      regularizer:
        name: None

  load_checkpoint: False
  load_path: ''

  config:
    name: Ant
    env_name: isaac-gym
    multi_gpu: False
    ppo: True
    mixed_precision: True  # FP16 for speed
    normalize_input: True
    normalize_value: True
    reward_shaper:
      scale_value: 0.01
    normalize_advantage: True
    gamma: 0.99
    tau: 0.95  # GAE lambda
    learning_rate: 3e-4
    lr_schedule: adaptive
    kl_threshold: 0.008
    score_to_win: 20000
    max_epochs: 5000
    save_best_after: 100
    save_frequency: 50
    print_stats: True
    grad_norm: 1.0
    entropy_coef: 0.0
    truncate_grads: True
    e_clip: 0.2
    horizon_length: 16
    minibatch_size: 32768
    mini_epochs: 5
    critic_coef: 2
    clip_value: True
    seq_len: 4
    bounds_loss_coef: 0.0001

Python Training Script¶

from rl_games.torch_runner import Runner
from rl_games.algos_torch import torch_ext
import yaml

def train():
    # Load config
    with open('config.yaml', 'r') as f:
        config = yaml.safe_load(f)

    # Create runner
    runner = Runner()
    runner.load(config)

    # Register custom environment if needed
    # runner.register_env('my_env', create_env_fn)

    # Train
    runner.run({
        'train': True,
        'play': False,
        'checkpoint': '',  # Path to resume from
        'sigma': None  # Override exploration noise
    })

if __name__ == "__main__":
    train()

Running Training¶

# Train with config
python runner.py --train --file config.yaml

# Resume from checkpoint
python runner.py --train --file config.yaml --checkpoint runs/Ant/nn/Ant.pth

# Play (evaluate) trained policy
python runner.py --play --file config.yaml --checkpoint runs/Ant/nn/Ant.pth

# Multi-GPU training
python runner.py --train --file config.yaml --num_actors 2

Core Components¶

1. PPO Implementation¶

RL Games' PPO is highly optimized for GPU:

from rl_games.algos_torch import a2c_continuous
from rl_games.common import vecenv

class A2CAgent(a2c_continuous.A2CAgent):
    """
    PPO implementation (called A2C but is actually PPO)

    Key optimizations:
    - Full GPU pipeline (no CPU<->GPU transfers)
    - Mixed precision training
    - Vectorized advantage computation
    - Multi-GPU support via data parallelism
    """

    def __init__(self, base_name, config):
        super().__init__(base_name, config)

        # Network
        self.model = self.network.build(config['network'])

        # Optimizer
        self.optimizer = torch.optim.Adam(
            self.model.parameters(),
            lr=config['learning_rate'],
            eps=1e-5
        )

        # Mixed precision
        if config['mixed_precision']:
            self.scaler = torch.cuda.amp.GradScaler()

        # Multi-GPU
        if config['multi_gpu']:
            self.model = torch.nn.DataParallel(self.model)

    def train_epoch(self):
        """Single PPO epoch"""
        # Collect experience
        batch_dict = self.play_steps()

        # Compute returns and advantages
        self.compute_returns(batch_dict)

        # Mini-batch updates
        for _ in range(self.mini_epochs):
            for batch in self.dataset.mini_batches(self.minibatch_size):
                # Forward pass
                with torch.cuda.amp.autocast(enabled=self.mixed_precision):
                    res_dict = self.model(batch)

                    # Compute losses
                    losses = self.calc_losses(batch, res_dict)
                    total_loss = losses['total_loss']

                # Backward pass
                self.optimizer.zero_grad()

                if self.mixed_precision:
                    self.scaler.scale(total_loss).backward()
                    self.scaler.unscale_(self.optimizer)
                    torch.nn.utils.clip_grad_norm_(
                        self.model.parameters(),
                        self.grad_norm
                    )
                    self.scaler.step(self.optimizer)
                    self.scaler.update()
                else:
                    total_loss.backward()
                    torch.nn.utils.clip_grad_norm_(
                        self.model.parameters(),
                        self.grad_norm
                    )
                    self.optimizer.step()

        return losses

    def calc_losses(self, batch, res_dict):
        """Calculate PPO losses"""
        # Unpack
        actions = batch['actions']
        old_log_probs = batch['old_log_probs']
        advantages = batch['advantages']
        returns = batch['returns']
        old_values = batch['old_values']

        # Current policy
        log_probs = res_dict['log_probs']
        values = res_dict['values']
        entropy = res_dict['entropy']

        # PPO clipped loss
        ratio = torch.exp(log_probs - old_log_probs)
        surr1 = ratio * advantages
        surr2 = torch.clamp(ratio, 1.0 - self.e_clip, 1.0 + self.e_clip) * advantages
        policy_loss = -torch.min(surr1, surr2).mean()

        # Value loss (clipped)
        if self.clip_value:
            value_pred_clipped = old_values + torch.clamp(
                values - old_values, -self.e_clip, self.e_clip
            )
            value_losses = (values - returns).pow(2)
            value_losses_clipped = (value_pred_clipped - returns).pow(2)
            value_loss = torch.max(value_losses, value_losses_clipped).mean()
        else:
            value_loss = (returns - values).pow(2).mean()

        # Total loss
        total_loss = (
            policy_loss
            + self.critic_coef * value_loss
            - self.entropy_coef * entropy.mean()
        )

        return {
            'total_loss': total_loss,
            'policy_loss': policy_loss,
            'value_loss': value_loss,
            'entropy': entropy.mean()
        }

2. Network Architectures¶

RL Games supports various network architectures:

from rl_games.algos_torch import network_builder

class ActorCriticBuilder(network_builder.NetworkBuilder):
    """Build actor-critic networks"""

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def build(self, name, **kwargs):
        """Build network from config"""
        net = ActorCriticNetwork(
            input_shape=kwargs['input_shape'],
            actions_num=kwargs['actions_num'],
            mlp_units=kwargs['mlp']['units'],
            activation=kwargs['mlp']['activation'],
            separate=kwargs['separate']
        )
        return net


class ActorCriticNetwork(nn.Module):
    """
    Actor-Critic network with optional shared backbone

    Supports:
    - Separate or shared actor/critic
    - LSTM/GRU for recurrent policies
    - Multi-head outputs (e.g., for multi-task)
    """

    def __init__(
        self,
        input_shape,
        actions_num,
        mlp_units=[256, 128, 64],
        activation='elu',
        separate=False,
        use_rnn=False,
        rnn_units=256
    ):
        super().__init__()

        self.separate = separate
        self.use_rnn = use_rnn

        # Activation
        act = get_activation(activation)

        if separate:
            # Separate actor and critic
            self.actor = build_mlp(input_shape, mlp_units, act)
            self.critic = build_mlp(input_shape, mlp_units, act)
            actor_input_size = mlp_units[-1]
            critic_input_size = mlp_units[-1]
        else:
            # Shared backbone
            self.backbone = build_mlp(input_shape, mlp_units, act)
            actor_input_size = mlp_units[-1]
            critic_input_size = mlp_units[-1]

        # RNN (optional)
        if use_rnn:
            self.rnn = nn.LSTM(
                input_size=mlp_units[-1],
                hidden_size=rnn_units,
                num_layers=1,
                batch_first=True
            )
            actor_input_size = rnn_units
            critic_input_size = rnn_units

        # Actor head (policy)
        self.mu = nn.Linear(actor_input_size, actions_num)
        self.log_std = nn.Parameter(torch.zeros(actions_num))

        # Critic head (value)
        self.value = nn.Linear(critic_input_size, 1)

        # Initialize
        self.apply(init_weights)

    def forward(self, obs, rnn_states=None):
        """Forward pass"""
        if self.separate:
            actor_features = self.actor(obs)
            critic_features = self.critic(obs)
        else:
            features = self.backbone(obs)
            actor_features = features
            critic_features = features

        # RNN
        if self.use_rnn:
            batch_size = obs.shape[0]
            seq_len = obs.shape[1]

            actor_features, rnn_states = self.rnn(
                actor_features.view(batch_size, seq_len, -1),
                rnn_states
            )
            critic_features = actor_features

        # Policy
        mu = self.mu(actor_features)
        std = torch.exp(self.log_std)

        # Sample actions
        dist = torch.distributions.Normal(mu, std)
        actions = dist.sample()
        log_probs = dist.log_prob(actions).sum(dim=-1)
        entropy = dist.entropy().sum(dim=-1)

        # Value
        values = self.value(critic_features).squeeze(-1)

        return {
            'actions': actions,
            'log_probs': log_probs,
            'entropy': entropy,
            'values': values,
            'rnn_states': rnn_states
        }


def build_mlp(input_size, units, activation):
    """Build MLP"""
    layers = []
    prev_size = input_size

    for size in units:
        layers.append(nn.Linear(prev_size, size))
        layers.append(activation())
        prev_size = size

    return nn.Sequential(*layers)


def get_activation(name):
    """Get activation function"""
    if name == 'elu':
        return nn.ELU
    elif name == 'relu':
        return nn.ReLU
    elif name == 'tanh':
        return nn.Tanh
    else:
        raise ValueError(f"Unknown activation: {name}")


def init_weights(module):
    """Orthogonal initialization"""
    if isinstance(module, nn.Linear):
        nn.init.orthogonal_(module.weight, gain=1.0)
        if module.bias is not None:
            nn.init.constant_(module.bias, 0.0)

3. Custom Environments¶

Register custom environments:

from rl_games.common import env_configurations
from rl_games.common import vecenv

def create_my_env(**kwargs):
    """Create custom environment"""
    import gym

    # Your custom environment
    env = gym.make('MyEnv-v0')

    # Wrap for vectorization
    return env

# Register environment
env_configurations.register(
    'my_env',
    {
        'vecenv_type': 'RAY',  # or 'SIMPLE'
        'env_creator': lambda **kwargs: create_my_env(**kwargs)
    }
)

# Use in config.yaml
# config:
#   env_name: my_env

Advanced Features¶

Mixed Precision Training¶

Enables FP16 for 2x speedup:

config:
  mixed_precision: True
  normalize_input: True
  normalize_value: True

Speedup: 1.5-2x faster, half memory usage

Multi-GPU Training¶

config:
  multi_gpu: True
  num_actors: 2  # Number of GPUs

# Launch multi-GPU training
python runner.py --train --file config.yaml --num_actors 2

Speedup: Nearly linear with GPU count (1.8-1.9x with 2 GPUs)

Recurrent Policies (LSTM/GRU)¶

For partially observable tasks:

network:
  name: actor_critic

  rnn:
    name: lstm
    units: 256
    layers: 1
    before_mlp: False  # RNN after MLP

  mlp:
    units: [256, 128]
    activation: elu

When to use: - Partial observability (e.g., limited sensor range) - Tasks requiring memory (e.g., navigation) - Time-series tasks

Self-Play¶

For competitive multi-agent tasks:

config:
  self_play_config:
    use_selfplay: True
    save_every_steps: 10000
    swap_steps: 8000  # Swap opponent every N steps
    games_to_check: 200  # Games to determine winner
    win_rate: 0.55  # Required win rate to update opponent

Asymmetric Actor-Critic¶

Different observations for actor (deployed) vs critic (training):

network:
  name: actor_critic_asymmetric

  actor:
    input_shape: [48]  # Deployed observations
    mlp:
      units: [256, 128, 64]

  critic:
    input_shape: [187]  # Privileged observations
    mlp:
      units: [512, 256, 128]

Use case: Sim-to-real (critic sees perfect sim state, actor sees real observations)

Isaac Gym Integration¶

Complete Example¶

"""
train_humanoid.py

Train humanoid with RL Games + Isaac Gym
"""

from rl_games.torch_runner import Runner
from isaacgym import gymapi
import yaml

# Isaac Gym configuration
gym_config = {
    "name": "Humanoid",
    "physics_engine": gymapi.SIM_PHYSX,
    "sim": {
        "dt": 0.0166,
        "substeps": 2,
        "up_axis": "z",
        "use_gpu_pipeline": True,
        "physx": {
            "num_threads": 4,
            "solver_type": 1,
            "num_position_iterations": 4,
            "num_velocity_iterations": 0,
            "contact_offset": 0.002,
            "rest_offset": 0.0,
            "bounce_threshold_velocity": 0.2,
            "max_depenetration_velocity": 10.0,
            "default_buffer_size_multiplier": 2.0
        }
    },
    "task": {
        "randomize": True,
        "randomization_params": {
            "frequency": 600,
            "observations": {
                "range": [0, 0.002],
                "operation": "additive"
            },
            "actions": {
                "range": [0.0, 0.02],
                "operation": "additive"
            },
            "sim_params": {
                "gravity": {
                    "range": [0, 0.4],
                    "operation": "additive"
                }
            },
            "actor_params": {
                "humanoid": {
                    "color": True,
                    "dof_properties": {
                        "damping": {
                            "range": [0.5, 1.5],
                            "operation": "scaling"
                        },
                        "stiffness": {
                            "range": [0.5, 1.5],
                            "operation": "scaling"
                        }
                    },
                    "rigid_body_properties": {
                        "mass": {
                            "range": [0.5, 1.5],
                            "operation": "scaling"
                        }
                    }
                }
            }
        }
    },
    "env": {
        "numEnvs": 4096,
        "envSpacing": 5,
        "enableDebugVis": False
    }
}

# RL Games configuration
rl_config = {
    "params": {
        "algo": {
            "name": "a2c_continuous"
        },
        "model": {
            "name": "continuous_a2c_logstd"
        },
        "network": {
            "name": "actor_critic",
            "separate": False,
            "space": {
                "continuous": {
                    "mu_activation": "None",
                    "sigma_activation": "None",
                    "mu_init": {"name": "default"},
                    "sigma_init": {"name": "const_initializer", "val": 0},
                    "fixed_sigma": True
                }
            },
            "mlp": {
                "units": [400, 200, 100],
                "activation": "elu",
                "initializer": {"name": "default"}
            }
        },
        "config": {
            "name": "Humanoid",
            "env_name": "isaac",
            "multi_gpu": False,
            "ppo": True,
            "mixed_precision": True,
            "normalize_input": True,
            "normalize_value": True,
            "reward_shaper": {"scale_value": 0.01},
            "normalize_advantage": True,
            "gamma": 0.99,
            "tau": 0.95,
            "learning_rate": 2e-4,
            "lr_schedule": "adaptive",
            "kl_threshold": 0.008,
            "score_to_win": 10000,
            "max_epochs": 10000,
            "save_best_after": 100,
            "save_frequency": 200,
            "grad_norm": 1.0,
            "entropy_coef": 0.0,
            "e_clip": 0.2,
            "horizon_length": 32,
            "minibatch_size": 32768,
            "mini_epochs": 5,
            "critic_coef": 4,
            "clip_value": True
        }
    }
}

def train():
    # Create runner
    runner = Runner()
    runner.load(rl_config)

    # Train
    runner.run({
        'train': True,
        'play': False,
        'checkpoint': ''
    })

if __name__ == "__main__":
    train()

Running on Isaac Gym¶

# Train
python train_humanoid.py

# With specific GPU
CUDA_VISIBLE_DEVICES=0 python train_humanoid.py

# Multi-GPU
python train_humanoid.py --num_actors 4

# Play trained policy
python train_humanoid.py --play --checkpoint runs/Humanoid/nn/Humanoid.pth

Performance Optimization¶

Benchmarks¶

Training speed on Humanoid task (4096 environments, RTX 3090):

Configuration	FPS	Time to 10M steps
RL Games (FP16, GPU pipeline)	145,000	68 seconds
RL Games (FP32, GPU pipeline)	89,000	112 seconds
RSL-RL	72,000	139 seconds
Stable-Baselines3 (CPU)	3,500	47 minutes

Speedup: 40x faster than SB3!

Optimization Tips¶

# Maximum speed configuration
config:
  # GPU pipeline (critical!)
  use_gpu_pipeline: True

  # Mixed precision (2x speedup)
  mixed_precision: True

  # Large batches for GPU efficiency
  minibatch_size: 32768
  horizon_length: 32

  # Normalize for stability
  normalize_input: True
  normalize_value: True
  normalize_advantage: True

  # Adaptive LR for speed
  lr_schedule: adaptive

  # Reduce logging overhead
  print_stats: False  # Only when debugging

  # Efficient value loss
  clip_value: True

Tips & Best Practices¶

Hyperparameter Tuning¶

Start with defaults:

config:
  gamma: 0.99
  tau: 0.95
  learning_rate: 3e-4
  e_clip: 0.2
  mini_epochs: 5
  horizon_length: 16
  minibatch_size: 32768

For hard exploration:

config:
  entropy_coef: 0.01  # Increase exploration
  learning_rate: 1e-4  # Slower learning
  tau: 0.9  # Lower GAE lambda

For sample efficiency:

config:
  horizon_length: 64  # Longer rollouts
  mini_epochs: 10  # More SGD steps
  minibatch_size: 16384  # Smaller batches

Debugging¶

# Enable verbose logging
config:
  print_stats: True

# Check gradients
config:
  grad_norm: 0.5  # Lower if gradients exploding

# Visualize training
config:
  enable_tensorboard: True

# Profile performance
import cProfile
cProfile.run('runner.run(args)', 'stats.prof')

# Analyze profile
import pstats
p = pstats.Stats('stats.prof')
p.sort_stats('cumulative').print_stats(20)

Common Issues¶

Problem: NaN losses

# Solution: Reduce LR, clip gradients
config:
  learning_rate: 1e-4
  grad_norm: 0.5
  clip_value: True

Problem: Slow convergence

# Solution: Increase LR, add entropy
config:
  learning_rate: 5e-4
  entropy_coef: 0.01
  mini_epochs: 8

Problem: OOM (Out of Memory)

# Solution: Reduce batch size
config:
  minibatch_size: 16384
  horizon_length: 16

References¶

Official Resources¶

RL Games GitHub: https://github.com/Denys88/rl_games
Documentation: https://github.com/Denys88/rl_games/tree/master/docs
Isaac Gym: https://developer.nvidia.com/isaac-gym

Papers¶

Isaac Gym: Makoviychuk et al., "Isaac Gym: High Performance GPU-Based Physics Simulation", NeurIPS 2021
GPU RL: Makoviychuk & Makoviichuk, "RL Games: High Performance RL Library", 2021

Benchmarks¶

Isaac Gym Benchmark: https://github.com/NVIDIA-Omniverse/IsaacGymEnvs
Comprehensive benchmarks for all algorithms
Pre-configured tasks

Next Steps¶

RSL-RL - Alternative Isaac Lab RL library
Stable-Baselines3 - More general-purpose RL
Isaac Gym Envs - Benchmark environments