In imitation learning, can we learn a reward for which the behavorial-cloned policy is optimal?

Using entropy-regularized RL, the policy defines this reward. We call this property coherency.

Coherency in entropy-regularized reinforcement learning

Ng et al. 1999 showed that a potential function $\Psi$ can shape a reward function without changing the optimal policy, $$ \underbrace{r(\mathbf{s},\mathbf{a})}_{\text{True reward}} = \underbrace{\tilde{r}(\mathbf{s},\mathbf{a})}_{\text{Shaped reward}} + \underbrace{\Psi(\mathbf{s}) - \gamma\,\mathbb{E}_{\mathbf{s}'\sim \mathcal{P}(\cdot\mid\mathbf{s},\mathbf{a})}[\Psi(\mathbf{s}')]}_{\text{Potential function shaping}}. $$ $$ \underbrace{\mathcal{Q}(\mathbf{s},\mathbf{a})}_{\text{True critic}} = \underbrace{\tilde{\mathcal{Q}}(\mathbf{s},\mathbf{a})}_{\text{Shaped critic}} + \underbrace{\Psi(\mathbf{s})}_{\text{Potential}}. $$ In entropy-regularized RL, the optimal policy take the form of a pseudo-posterior, $$ \underbrace{q_\alpha(\mathbf{a}\mid\mathbf{s})}_{\text{Cloned policy}} = \underbrace{\exp\left(\frac{1}{\alpha}\,(\mathcal{Q}(\mathbf{s},\mathbf{a}) - \mathcal{V}_\alpha(\mathbf{s}))\right)}_{\text{Unknown advantage function}}\, \underbrace{p(\mathbf{a}\mid\mathbf{s})}_{\text{Policy prior}}. $$ Rearranging terms, we can express the critic in terms of the log policy ratio and soft value function $$ \underbrace{\mathcal{Q}(\mathbf{s},\mathbf{a})}_{\text{Unknown critic}} = \underbrace{\alpha\log\frac{q_\alpha(\mathbf{a}\mid\mathbf{s})}{p(\mathbf{a}\mid\mathbf{s})}}_{\text{Log policy ratio}} + \underbrace{\mathcal{V}_\alpha(\mathbf{s})}_{\text{Soft value function}}. $$ Comparing this result with that of Ng et al., we show that the log policy ratio is a shaped reward function, $$ \text{Shaping theory:} \quad \underbrace{r(\mathbf{s},\mathbf{a})}_{\text{True reward}} = \underbrace{\tilde{r}(\mathbf{s},\mathbf{a})}_{\text{Shaped reward}} + \underbrace{\Psi(\mathbf{s}) - \gamma\,\mathbb{E}_{\mathbf{s}'\sim \mathcal{P}(\cdot\mid\mathbf{s},\mathbf{a})}[\Psi(\mathbf{s}')]}_{\text{Potential function shaping}}. $$ $$ \text{Coherent reward:} \quad \underbrace{r(\mathbf{s},\mathbf{a})}_{\text{True reward}} = \underbrace{\alpha\log \frac{q_\alpha(\mathbf{a}\mid\mathbf{s})}{p(\mathbf{a}\mid\mathbf{s})}}_{\text{Coherent reward}} + \underbrace{\mathcal{V}_\alpha(\mathbf{s}) - \gamma\, \mathbb{E}_{\mathbf{s}'\sim \mathcal{P}(\cdot\mid\mathbf{s},\mathbf{a})}\left[\mathcal{V}_\alpha(\mathbf{s}')\right]}_{\text{Soft value function shaping}}. $$ To leverage coherency, behavioral cloning amounts to KL-regularized heteroscedastic regression, $$ \textstyle\max_\mathbf{\theta} \mathbb{E}_{\mathbf{a},\mathbf{s}\sim \mathcal{D}}[\log q_\mathbf{\theta}(\mathbf{a}\mid\mathbf{s})] {\,-\,} \lambda\, \mathbb{D}_{\text{KL}}[q_\mathbf{\theta}(\mathbf{w})\mid\mid p(\mathbf{w})], $$ $$ \text{ where } q_\mathbf{\theta}(\mathbf{a}\mid\mathbf{s}) = \int p(\mathbf{a}\mid\mathbf{s},\mathbf{w}) \, q_\mathbf{\theta}(\mathbf{w}) \,\mathrm{d}\mathbf{w}, \quad p(\mathbf{a}\mid\mathbf{s}) = \int p(\mathbf{a}\mid\mathbf{s},\mathbf{w}) \, p(\mathbf{w}) \,\mathrm{d}\mathbf{w}. $$

Using this regularized regression objective, the cloned policy can be used to define the shaped reward using the log policy ratio, in contrast to prior work which learn the reward using classification.

Coherent Soft Imitation Learning

Coherency in entropy-regularized reinforcement learning

Stationary policies in the tabular and continuous settings

Continuous control from agent demonstrations

Continuous control from human demonstrations from states

Continuous control from human demonstrations from images