Coherent Soft Imitation Learning
Joe Watson, Sandy H. Huang, Nicolas Heess
Advances In Neural Information Processing Systems (NeurIPS), 2023 [Spotlight]
Using entropy-regularized RL, the policy defines this reward. We call this property coherency.
Coherency in entropy-regularized reinforcement learning
Ng et al. 1999 showed that a potential function can shape a reward function without changing the optimal policy,
In entropy-regularized RL, the optimal policy take the form of a pseudo-posterior,
Rearranging terms, we can express the critic in terms of the log policy ratio and soft value function
Comparing this result with that of Ng et al., we show that the log policy ratio is a shaped reward function,
To leverage coherency, behavioral cloning amounts to KL-regularized heteroscedastic regression,
Using this regularized regression objective, the cloned policy can be used to define the shaped reward using the log policy ratio, in contrast to prior work which learn the reward using classification.
|
|