Skip to content

Introduction to Imitation Learning

What is Imitation Learning?

Imitation Learning (IL) is a paradigm where agents learn to perform tasks by observing and mimicking expert behavior.

Motivation

The Reward Specification Problem

Defining good reward functions for RL is hard:

# Too sparse
reward = 1 if task_complete else 0

# Too dense (might encourage shortcuts)
reward = -distance_to_goal + action_smoothness - collision_penalty + ...

IL sidesteps this by learning directly from demonstrations.

Formal Framework

Problem Setup

Given: - Expert policy \(\pi^*\) (unknown) - Demonstrations \(\mathcal{D} = \{(s_1, a_1), (s_2, a_2), ..., (s_N, a_N)\}\) - Sampled from \(\pi^*\)

Goal: - Learn policy \(\hat{\pi}\) that mimics \(\pi^*\)

Objective

Minimize expected difference between learned and expert policy:

\[ \min_{\pi} \mathbb{E}_{s \sim d^{\pi^*}} [c(s, \pi(s), \pi^*(s))] \]

where \(d^{\pi^*}\) is state distribution under expert policy, and \(c\) is a cost function.

The Distribution Shift Problem

Compounding Errors

Behavioral cloning suffers from distribution shift:

  1. Policy trained on expert state distribution
  2. Deployment: policy makes small mistakes
  3. Reaches states never seen in training
  4. Makes larger mistakes (no training data)
  5. Errors compound over time
graph TD
    A[Expert States] --> B[Train Policy]
    B --> C[Deploy Policy]
    C --> D[Small Error]
    D --> E[Novel State]
    E --> F[Large Error]
    F --> G[Catastrophic Failure]

Mathematical Analysis

Error bound for behavioral cloning:

\[ J(\pi^*) - J(\hat{\pi}) \leq \epsilon T^2 \]
  • \(\epsilon\): Per-step error
  • \(T\): Episode length
  • Quadratic in horizon!

IL Algorithm Families

1. Behavioral Cloning

Approach: Supervised learning

Algorithm:

Input: Expert demonstrations D = {(s, a)}
Output: Policy π

1. Collect dataset D from expert
2. Train policy π(a|s) to maximize likelihood:
   max_π Σ log π(a|s) for (s,a) in D

When to use: - Demonstrations are on-policy (cover deployment distribution) - Short horizon tasks - Expert is consistent

2. Interactive IL (DAgger)

Approach: Query expert during training

Algorithm:

Initialize: Dataset D = D_expert
for iteration in 1..N:
    1. Train policy π on D
    2. Execute π, collect states S
    3. Query expert for actions A on states S
    4. Add (S, A) to D
    5. Retrain

When to use: - Expert available during training - Need to handle distribution shift - Can afford interactive learning

3. Inverse Reinforcement Learning

Approach: Learn reward, then optimize it

Algorithm:

1. Infer reward function R from demonstrations
2. Use RL to find policy that maximizes R

When to use: - Want interpretable reward - Need generalization to new scenarios - Have compute for RL step

Theoretical Foundations

Performance Bounds

DAgger Theorem:

With DAgger, error is linear in horizon:

\[ J(\pi^*) - J(\hat{\pi}) \leq \epsilon T \]

Much better than BC's quadratic bound!

Sample Complexity

Number of demonstrations needed scales with:

  • State space dimensionality
  • Action space complexity
  • Task horizon
  • Desired performance level

Practical Considerations

Multi-Modal Actions

Demonstrations may show multiple valid strategies:

# Example: Two valid grasps for same object
demonstration_1: approach_from_top()
demonstration_2: approach_from_side()

# BC will average these actions (bad!)
learned_action: approach_from_middle() # Fails!

Solutions: - Mixture of Experts - Action discretization - Diffusion policies - Energy-based models

Suboptimal Demonstrations

Real demonstrations are rarely perfect:

# Handle suboptimal data
def train_with_confidence_weights(demonstrations):
    for demo in demonstrations:
        # Weight by demonstration quality
        weight = estimate_optimality(demo)
        loss = weight * bc_loss(demo)

Modern Extensions

Generative Behavioral Cloning

Use generative models for multi-modal distributions:

  • VAE: Learn latent space of actions
  • Diffusion: Iteratively denoise to action
  • Flow Matching: Transform noise to action

One-Shot Imitation

Learn from single demonstration:

  • Meta-learning approaches
  • Context-conditioned policies
  • Vision-language models

Learning from Videos

Learn without action labels:

  • Video prediction models
  • Inverse models
  • Time-contrastive learning

Next Steps