Coherent Soft Imitation Learning

Joe Watson, Sandy H. Huang, Nicolas Heess

Advances In Neural Information Processing Systems (NeurIPS), 2023 [Spotlight]

csil


In imitation learning, can we learn a reward for which the behavorial-cloned policy is optimal?

Using entropy-regularized RL, the policy defines this reward. We call this property coherency.


  1. Coherency in entropy-regularized reinforcement learning
  2. Stationary policies in the tabular and continuous settings
  3. Continuous control from agent demonstrations
  4. Continuous control from human demonstrations from states
  5. Continuous control from human demonstrations from images

Coherency in entropy-regularized reinforcement learning

Ng et al. 1999 showed that a potential function \(\Psi\) can shape a reward function without changing the optimal policy, $$ \underbrace{r(\mathbf{s},\mathbf{a})}_{\text{True reward}} = \underbrace{\tilde{r}(\mathbf{s},\mathbf{a})}_{\text{Shaped reward}} + \underbrace{\Psi(\mathbf{s}) - \gamma\,\mathbb{E}_{\mathbf{s}'\sim \mathcal{P}(\cdot\mid\mathbf{s},\mathbf{a})}[\Psi(\mathbf{s}')]}_{\text{Potential function shaping}}. $$ $$ \underbrace{\mathcal{Q}(\mathbf{s},\mathbf{a})}_{\text{True critic}} = \underbrace{\tilde{\mathcal{Q}}(\mathbf{s},\mathbf{a})}_{\text{Shaped critic}} + \underbrace{\Psi(\mathbf{s})}_{\text{Potential}}. $$ In entropy-regularized RL, the optimal policy take the form of a pseudo-posterior, $$ \underbrace{q_\alpha(\mathbf{a}\mid\mathbf{s})}_{\text{Cloned policy}} = \underbrace{\exp\left(\frac{1}{\alpha}\,(\mathcal{Q}(\mathbf{s},\mathbf{a}) - \mathcal{V}_\alpha(\mathbf{s}))\right)}_{\text{Unknown advantage function}}\, \underbrace{p(\mathbf{a}\mid\mathbf{s})}_{\text{Policy prior}}. $$ Rearranging terms, we can express the critic in terms of the log policy ratio and soft value function $$ \underbrace{\mathcal{Q}(\mathbf{s},\mathbf{a})}_{\text{Unknown critic}} = \underbrace{\alpha\log\frac{q_\alpha(\mathbf{a}\mid\mathbf{s})}{p(\mathbf{a}\mid\mathbf{s})}}_{\text{Log policy ratio}} + \underbrace{\mathcal{V}_\alpha(\mathbf{s})}_{\text{Soft value function}}. $$ Comparing this result with that of Ng et al., we show that the log policy ratio is a shaped reward function, $$ \text{Shaping theory:} \quad \underbrace{r(\mathbf{s},\mathbf{a})}_{\text{True reward}} = \underbrace{\tilde{r}(\mathbf{s},\mathbf{a})}_{\text{Shaped reward}} + \underbrace{\Psi(\mathbf{s}) - \gamma\,\mathbb{E}_{\mathbf{s}'\sim \mathcal{P}(\cdot\mid\mathbf{s},\mathbf{a})}[\Psi(\mathbf{s}')]}_{\text{Potential function shaping}}. $$ $$ \text{Coherent reward:} \quad \underbrace{r(\mathbf{s},\mathbf{a})}_{\text{True reward}} = \underbrace{\alpha\log \frac{q_\alpha(\mathbf{a}\mid\mathbf{s})}{p(\mathbf{a}\mid\mathbf{s})}}_{\text{Coherent reward}} + \underbrace{\mathcal{V}_\alpha(\mathbf{s}) - \gamma\, \mathbb{E}_{\mathbf{s}'\sim \mathcal{P}(\cdot\mid\mathbf{s},\mathbf{a})}\left[\mathcal{V}_\alpha(\mathbf{s}')\right]}_{\text{Soft value function shaping}}. $$ To leverage coherency, behavioral cloning amounts to KL-regularized heteroscedastic regression, $$ \textstyle\max_\mathbf{\theta} \mathbb{E}_{\mathbf{a},\mathbf{s}\sim \mathcal{D}}[\log q_\mathbf{\theta}(\mathbf{a}\mid\mathbf{s})] {\,-\,} \lambda\, \mathbb{D}_{\text{KL}}[q_\mathbf{\theta}(\mathbf{w})\mid\mid p(\mathbf{w})], $$ $$ \text{ where } q_\mathbf{\theta}(\mathbf{a}\mid\mathbf{s}) = \int p(\mathbf{a}\mid\mathbf{s},\mathbf{w}) \, q_\mathbf{\theta}(\mathbf{w}) \,\mathrm{d}\mathbf{w}, \quad p(\mathbf{a}\mid\mathbf{s}) = \int p(\mathbf{a}\mid\mathbf{s},\mathbf{w}) \, p(\mathbf{w}) \,\mathrm{d}\mathbf{w}. $$

Using this regularized regression objective, the cloned policy can be used to define the shaped reward using the log policy ratio, in contrast to prior work which learn the reward using classification. csil

Stationary policies in the tabular and continuous settings

hetstat Stationary policies are straightforward to implement in the tabular setting (left). The equivalent policy in the continuous setting (center) is harder to implement. We adopt a special neural network architecture that approximates the desired stationary behaviour (right).

Continuous control from agent demonstrations

door-v0 [online, states] (csil)
door-v0 [online, states] (iqlearn)
hammer-v0 [online, states] (csil)
hammer-v0 [online, states] (iqlearn)
HalfCheetah-v2 [offline, states] (csil)
HalfCheetah-v2 [offline, states] (iqlearn)

Continuous control from human demonstrations from states

Lift [online, states] (csil)
Lift [online, states] (iqlearn)
PickPlaceCan [online, states] (csil)
PickPlaceCan [online, states] (iqlearn)
NutAssemblySquare [online, states] (csil)
NutAssemblySquare [online, states] (iqlearn)

Continuous control from human demonstrations from images

Lift [online, images] (csil)
PickPlaceCan [online, images] (csil)