Inferring Smooth Control:

Monte Carlo Posterior Policy Iteration with Gaussian Processes

Joe Watson & Jan Peters

Intelligent Autonomous Systems group, TU Darmstadt

joe@robot-learning.de

@JoeMWatson

Motivation

iCEM model predictive control
(Pinneri et al. 2020)

icem icem

(this work)

Stochastic Search for Control


How do we preserve action correlations for smoothness?
Sample from the joint distribution! Fit the correlations!

$\va_t \sim p(\cdot)$
❗️$\mA \sim p(\cdot) = p(\va_1, \va_2, \dots)$❗️
$p(\va_1, \va_2, \dots) = \prod_t p(\va_t)$ 🚨
$q(\mA) = \sum_n w^{(n)} \delta(\mA - \mA^{(n)})$
$q(\mA) = \sum_n w^{(n)} \delta(\mA - \mA^{(n)})$
$q(\mA) = \sum_n w^{(n)} \delta(\mA - \mA^{(n)})$
$q(\mA) = \sum_n w^{(n)} \delta(\mA - \mA^{(n)})$
❗️$ \min_{\color{#FF217D}{p}} \KL[q \mid\mid \color{#FF217D}{p}]$❗️

a t n w

Posterior Policy Iteration


Optimizing for episodic return $R(\mA) = \E_{\vs_{t+1}\sim p(\cdot\mid\vs_t,\va_t),\,\vs_1\sim p(\cdot)}\left[\textstyle\sum_t r(\vs_t,\va_t)\right]$, what is $w^{(n)} = f(R^{(n)})$?


The objective $$ \textstyle\max_q \E_{\mA\sim p(\cdot)}\left[\alpha\,R(\mA)\right] - \KL[q(\mA)\mid\mid p(\mA)], $$

has a 'pseudo' (or Gibbs) posterior solution (e.g. Zellner (1998), Cantoni (2003)),

$$ q_\alpha(\mA) \propto \exp(\alpha R(\mA))\,p(\mA) \;\rightarrow\; w^{(n)}{\propto}\exp(\alpha R^{(n)}),\,\mA^{(n)}{\sim}p(\cdot). $$

$$ q_\alpha(\mA) \propto \exp(\alpha R(\mA))\,p(\mA) \;\rightarrow\; w^{(n)}{\propto}\exp(\alpha R^{(n)}),\,\mA^{(n)}{\sim}p(\cdot). $$

$$ q_\alpha(\mA) \propto \exp(\alpha R(\mA))\,p(\mA) \;\rightarrow\; w^{(n)}{\propto}\exp(\alpha R^{(n)}),\,\mA^{(n)}{\sim}p(\cdot). $$

$$ q_\alpha(\mA) \propto \exp(\alpha R(\mA))\,\color{blue}{p(\mA)}\;\rightarrow\; w^{(n)}{\propto}\exp(\alpha R^{(n)}),\,\mA^{(n)}{\sim}p(\cdot). $$

$$ q_\color{#FF217D}{\alpha}(\mA) \propto \exp(\color{#FF217D}{\alpha} R(\mA))\,\color{blue}{p(\mA)}\;\rightarrow\; w^{(n)}{\propto}\exp(\color{#FF217D}{\alpha} R^{(n)}),\,\mA^{(n)}{\sim}p(\cdot). $$

Posterior policy iteration (PPI): pseudo-posteriors for optimal control (Rawlik et al. (2013)).

Research questions:

$ \color{blue}{\text{1) How to design the prior for control?}} $

$ \color{#FF217D}{\text{2) How to set the temperature for Monte Carlo optimization?}} $

Action Priors for Control


For robotics, IID noise is not desirable for physical systems!

  • Jerky movements will damage a physical system and environment
  • Desirable solutions, like humans, typically move smoothly

Action Priors for Control

Action smoothing is often an implementation detail, rather than theory (Williams et al. 2018, Nagabandi et al. 2019, Pinneri et al. 2020, ...).

E.g. first-order smoothing: $ \beta\,\va_t^{(n)} + (1-\beta)\,\va_{t-1}^{(n)} $

Desiderata

  • Ensure the solution is sufficiently smooth...
  • ...without introducing lag.
  • Smooth exploratory samples for Monte Carlo efficiency.

Action Priors for Control

Nonparametric Gaussian processes (GP) (Rasmussen & Williams, 2006) capture a diverse range of smoothed stochastic processes.

$$p(a_1, a_2, \dots) = \gN(\mathbf{0}, \mSigma) = \gG\gP(0, \gC(\vt)),\, \vt = [t_1, t_2, \dots]$$

a t

Vector-valued Gaussian Processes


How can we scale nonparametric GPs to high-dimensional action sequences? $$ \mA \in \mathbb{R}^{H \times d_a} $$

A Matrix Normal Distribution
For a random matrix $\mX \in \sR^{n\times p}$, it follows a matrix normal distribution $\mX{\,\sim\,}\gM\gN(\mM, \mK, \mSigma)$, where $$ p(\mX) = \gM\gN(\mX;\mM, \mK, \mSigma) = \frac{1}{(2\pi)^{np/2}|\mK|^{p/2}|\mSigma|^{n/2}}\exp\left(-\frac{1}{2}\text{tr}\{\mSigma^{-1}(\mX-\mM)\tran\mK\inv(\mX-\mM)\}\right), $$ for $\mM\in\sR^{n\times p}, \mK\in \sR^{n\times n}, \mK \succcurlyeq 0$ and $\mSigma\in \sR^{p\times p}, \mSigma \succcurlyeq 0$, if and only if $\text{vec}(\mX){\,\sim\,}\gN(\text{vec}(\mM), \mSigma\otimes\mK)$, where $\otimes$ denotes the Kronecker product.

$$ \text{Kronecker product: }\; \begin{bmatrix} \Sigma_{11} & \dots &\Sigma_{1m} \\ \vdots & \ddots & \vdots \\ \Sigma_{n1} & \dots &\Sigma_{nm} \end{bmatrix} \otimes\mK = \begin{bmatrix} \Sigma_{11}\,\mK & \dots &\Sigma_{1m}\,\mK \\ \vdots & \ddots & \vdots \\ \Sigma_{n1}\,\mK & \dots &\Sigma_{nm}\,\mK \end{bmatrix} \quad \text{(covariance of flattened matrix)} $$

Define our action prior,
$$ \mA = [\va_t,\va_{t+1},\dots,\va_{t+1}] \sim \gM\gN(\vzero, \gK(\vt), \mSigma), \quad \vt = [t, t+\Delta, \dots, t+H\Delta], $$ for covariance function $\gK(t,t')$, timestep $\Delta$ and horizon $H$.

Continuous-time Gaussian Processes


How do we update our solution each time step for model predictive control?


Gaussian processes provide an elegant means to 'time shift' in MPC.


Plan for timesteps $t$ to $t+H$, based on previous solution up to time $\tau$, $$ q_\alpha(\va_{t:t+H}\mid \gO_{1:\tau}) = \textstyle\int q_\alpha(\va_{1:t+H}\mid\gO_{1:\tau})\,\mathrm{d}\va_{1:t} \propto \textstyle\int \underbrace{p(\gO_{1:\tau}\mid\va_{1:\tau})}_{\text{previous solution}}\, \underbrace{p(\va_{1:t+H})}_{\text{time-extended prior}} \, \mathrm{d}\va_{1:t}. $$

The previous solution is due to an unknown likelihood $\color{blue}{\gN(\vmu_{\gO|t_1:t_2}, \mSigma_{\gO|t_1:t_2})}$,
$ \vmu_{t_1:t_2|\gO} = \vmu_{t_1:t_2} + \mK_{t_1:t_2}(\color{blue}{\vmu_{\gO|t_1:t_2}}-\vmu_{t_1:t_2}) ,\quad \mSigma_{t_1:t_2, t_1:t_2|\gO} = \mSigma_{t_1:t_2,t_1:t_2}- \mK_{t_1:t_2}\color{blue}{\mSigma_{\gO|t_1:t_2}}\mK_{t_1:t_2}\tran, $
where $\mK_{t_1:t_2}{\,=\,}\mSigma_{t_1:t_2, t_1:t_2} \color{blue}{\mSigma_{\gO|t_1:t_2}}\inv$.


Compute unknown terms using old posterior and prior,
$\color{magenta}{\vnu_{t_1:t_2}} = \color{blue}{\mSigma_{\gO|t_1:t_2}}\inv(\color{blue}{\vmu_{\gO|t_1:t_2}}-\vmu_{t_1:t_2}) = \mSigma_{t_1:t_2, t_1:t_2}\inv(\vmu_{t_1:t_2|\gO} - \vmu_{t_1:t_2}) $,
$\color{magenta}{\mathbf{\Lambda}_{t_1:t_2}} = \color{blue}{\mSigma_{\gO|t_1:t_2,t_1:t_2}}\inv = \mSigma_{t_1:t_2,t_1:t_2}\inv(\mSigma_{t_1:t_2,t_1:t_2} - \mSigma_{t_1:t_2,t_1:t_2|\gO})\mSigma_{t_1:t_2,t_1:t_2}\inv $.


Combine these likelihood terms with the time-shifted prior $\color{turquoise}{\gN(\vmu_{t_3:t_4}, \mSigma_{t_3:t_4})}$ and cross-correlation $\color{SeaGreen}{\mSigma_{t_3:t_4,t_1:t_2}} = k(\vt_{t_3:t_4},\vt_{t_1:t_2})$
$\vmu_{t_3:t_4|\gO} = \color{turquoise}{\vmu_{t_3:t_4}} + \color{SeaGreen}{\mSigma_{t_3:t_4,t_1:t_2}} \color{magenta}{\vnu_{t_1:t_2}},$
$\mSigma_{t_3:t_4,t_3:t_4|\gO} = \color{turquoise}{\mSigma_{t_3:t_4,t_3:t_4}} - \color{SeaGreen}{\mSigma_{t_3:t_4,t_1:t_2}} \color{magenta}{\mathbf{\Lambda}_{t_1:t_2}} \color{SeaGreen}{\mSigma_{t_3:t_4,t_1:t_2}}\tran.$

Experimental Results

PPI model predictive control
(this work)

icem

Experimental Results

results results

Dirty Laundry 🧺🤢



  • Gaussian processes introduce implementation complexity and some numerical sensitivity.
  • Gaussian processes introduce hyperparameters (e.g. lengthscales) tuned offline.
  • Smoothness (i.e. lengthscales) cannot always be learned from expert demonstrations.

Inferring Smooth Control:
Monte Carlo Posterior Policy Iteration with Gaussian Processes

Joe Watson, Jan Peters

Intelligent Autonomous Systems group, TU Darmstadt

joe@robot-learning.de   @JoeMWatson   joemwatson.github.io

Poster: OGG (260), Level 0 Room 038
Paper: arxiv.org/abs/2210.03512
Code: github.com/JoeMWatson/monte-carlo-posterior-policy-iteration
Slides: joemwatson.github.io/slides/corl22oral
Website: monte-carlo-ppi.github.io

Posterior Constraints for Monte Carlo Optimization


$$ q_\color{#FF217D}{\alpha}(\mA){\,\propto\,}\exp(\color{#FF217D}{\alpha} R(\mA))\,p(\mA). $$ Choose $\color{#FF217D}{\alpha}$ to balance greediness and inference quality?

Posterior Constraints for Monte Carlo Optimization


Three ways to derive the pseudo-posterior $ q_\alpha(\mA){\,\propto\,}\exp(\alpha R(\mA))\,p(\mA): $


1) Gibbs likelihood, $\hspace{12.9em}\max_q \E_q[R] - \frac{1}{\alpha}\KL[q\mid\mid p]$.

2) Relative entropy policy search (REPS) (Peters et al.), $\max_q \E_q[R] \; \text{s.t.} \; \KL[q\mid\mid p] \leq \epsilon$.

3) Minimum KL problem (this work), $\hspace{6.3em} \min_q \KL[q \mid \mid p] \; \text{s.t.} \; \E_q[R] = R^*$.


$E_q[R]$ is approximated using importance sampling!


Lower-bound Policy Search (LBPS). Maximize $R^*$ using a probabilistic lower-bound of $\E_q[R]$,
$$ \max_\alpha \hat{R}^*_\alpha = \max_\alpha \E_{q_\alpha \, / p}[R(\mA)] - ||R||_\infty\sqrt{\frac{1-\color{#FF217D}{\delta}}{\color{#FF217D}{\delta}}\frac{1}{\color{blue}{\hat{N}_\alpha}}}, \quad \text{where $\color{blue}{\hat{N}_\alpha}$ is the effective sample size}. $$ we balance optimization with inference accuracy through bound probability $\color{#FF217D}{\delta} \in [0, 1]$.