Inference, Models and Priors
for Control

Joe Watson

Thesis Defence

Intelligent Autonomous Systems, Technical University of Darmstadt
Systems AI for Robot Learning, German Research Center for AI (DFKI)

'Able Mabel',
Tomorrow's World
(BBC Archive, 1966).

Teleoperating the PR1,
Wyrobek et al. (2008).

$\pi_0$, a large behaviour model
Physical Intelligence (2024).

Learn from demonstrations,

$\max_{\pi} \E_{\vs,\va\sim\gD_{\text{expert}}}[\log\pi(\va\mid\vs)].$

Sample actions from policy,

$\va_h \sim \pi(\cdot\mid\vs_h).$

Interact with environment,

$r_h = r(\vs_h,\va_h),\;\vs_{h+1} \sim p(\cdot\mid\vs_h,\va_h),\;\vtau = [\vs_1, \va_1, \dots].$

Maximize return,

$$ \pi_\star = \mathop{\arg\max}_\pi\; \E_{\vtau\sim\rho_\pi(\cdot)}\left[\sum_{h=1}^H r(\vs_h,\va_h)\right]. $$

Open problems:

Robot-friendly exploration and control (Part 1)
Improving an imitation policy with on-policy data (Part 2)
Sample-efficient learning using prior knowledge (Part 3)

The promise of robot learning

Imitation

Learn behaviours using demonstrations $\gD_{\text{expert}}$ and policy $\pi$.

Exploration

Collect data from the environment to learn efficiently.

Improvement

Learn to solve tasks using reward $r(\vs,\va)$,

$$ \pi_\star(\va_h\mid\vs_h) = \mathop{\arg\max}_\pi\; \E_{\vtau\sim\rho_\pi(\cdot)}\!\!\left[\sum_{h=1}^H r(\vs_h,\va_h)\right]\!\!. $$

Some open problems:

Robot-friendly exploration and control (Part 1)
Improving an imitation policy with on-policy data (Part 2)
Sample-efficient learning using prior knowledge (Part 3)

$$ \pi_\star(\va_h\mid\vs_h) = \mathop{\arg\max}_\pi\E_{\vtau\sim\rho_\pi(\cdot)}\left[\sum_{h=1}^H r(\vs_h,\va_h)\right]. $$

Consider any decision-making problem as stochastic optimization, \begin{aligned} q_{\star\vphantom{\coloralpha}}(\vx) &= \mathop{\arg\max}_{q}\;\E_{\vx\sim q(\cdot)}[ \color{#81a274}{f(\vx)} ]\phantom{{}-\coloralpha\,\KL[q(\vx)\mid\mid p(\vx)]}\\ &\phantom{\,\,\propto \exp\left(\frac{1}{\coloralpha} f(\vx)\right)\,p(\vx).} \end{aligned}

Convexify this objective with entropy-regularization, \begin{aligned} q_{\coloralpha}(\vx) &= \mathop{\arg\max}_{q}\;\E_{\vx\sim q(\cdot)}[ \color{#81a274}{f(\vx)} ] - \coloralpha\,\KL[q(\vx)\mid\mid \color{#58A4B0}{p(\vx)}]\\ &\phantom{\,\,\propto \exp\left(\frac{1}{\coloralpha} f(\vx)\right)\,p(\vx).} \end{aligned}

This problem has a closed-form solution, a `pseudo-posterior', \begin{aligned} \color{DarkOrchid}{q_{\coloralpha}(\vx)} &= \mathop{\arg\max}_{q}\;\E_{\vx\sim q(\cdot)}[ \color{#81a274}{f(\vx)} ]- \coloralpha\,\KL[q(\vx)\mid\mid \color{#58A4B0}{p(\vx)} ]\\ &\,\,\propto \exp\left(\frac{1}{\coloralpha} \color{#81a274}{f(\vx)} \right)\, \color{#58A4B0}{p(\vx)}. \end{aligned}

Bayesian inference,
$ \begin{aligned} \color{DarkOrchid}{q_\color{blue}{N}(\vtheta)} \propto \exp\left(\frac{\color{blue}{N}}{\color{#81a274}{N}}\color{#81a274}{\sum_{n=1}^N \log p(\vx_n\mid\vtheta)}\right) \color{#58A4B0}{p(\vtheta)} \quad \text{(Bayes' posterior)}. \end{aligned} $

Online mirror descent (Vovk (1990)),
$ \begin{aligned} \color{DarkOrchid}{q_{\color{blue}{t}}(\vx)} &\propto \exp\left( \vphantom{\sum_{n=1}^N} \color{blue}{\eta_t} \color{#81a274}{\sum_{h=1}^t f_h(\vx)} \right) \color{#58A4B0}{p(\vx)} \quad \text{(Exponential weights)}. \end{aligned}$

Reinforcement learning (Ziebart et al. (2008), Peters et al. (2010)),
$ \begin{aligned} \color{DarkOrchid}{\pi_\coloralpha(\va\mid\vs)} \propto \exp\left(\vphantom{\sum_{n=1}^N}\frac{1}{\coloralpha} \color{#81a274}{Q(\vs,\va)}\right) \color{#58A4B0}{p(\va\mid\vs)} \quad \text{(Boltzmann policy)}. \end{aligned} $

Variational inequality (Donsker & Varadhan (1976)),
$ \begin{aligned} \vphantom{\sum_{n=1}^N} \E_{\vx\sim\color{DarkOrchid}{q_\coloralpha(\vx)}} [\color{#81a274}{f(\vx)}] - \E_{\vx\sim\color{#58A4B0}{p(\vx)}} [\color{#81a274}{f(\vx)}] \geq \coloralpha\, \KL[\color{DarkOrchid}{q_\coloralpha(\vx)}\mid\mid \color{#58A4B0}{p(\vx)}] \geq 0. \end{aligned} $

Robot learning algorithms can be enhanced by
using more sophisticated statistical approaches.

Inference

$\color{#58A4B0}{\text{max}_{q}}\;\color{#58A4B0}{\E_{\vx\sim q(\cdot)}[} f(\vx) \color{#58A4B0}{]} - \frac{1}{\color{#58A4B0}{\alpha}}\color{#58A4B0}{\KL[}q(\vx) \;\color{#58A4B0}{\mid\mid}\; p(\vx) \color{#58A4B0}{]}$

Models

$\text{max}_{q}\;\E_{\vx\sim \color{#FF217D}{q(}\cdot \color{#FF217D}{)} }[f(\vx)]-\frac{1}{\alpha}\KL[ \color{#FF217D}{q(} \vx \color{#FF217D}{)} \mid\mid \color{#FF217D}{p(}\vx \color{#FF217D}{)} ]$

Priors

$\text{max}_{q}\;\E_{\vx\sim q(\cdot)}[f(\vx)]-\frac{1}{\alpha}\KL[q(\vx)\mid\mid \color{blue}{p(\vx)}]$

skills

manipulation

locomotion

optimal control

model-free

model-based

imitation

Robot learning algorithms can be enhanced by
using more sophisticated statistical approaches.

Distributions over actions

Inferring smooth control: Monte Carlo posterior policy iteration with Gaussian processes, CoRL (2022)

Distributions over behaviours

Coherent soft imitation learning,
NeurIPS (2023)

Distributions over predictions

Tractable Bayesian dynamics models from differentiable physics for learning and control,
R:SS WS (2022), ICRA@40 (2024), In preparation (2024)

$p(\va_1,\dots,\va_H)$

$p(\va_h\mid\vs_h)$

$p(\vs_{h+1}\mid\vs_h,\va_h)$

1. Introduction

Distributions over actions

Distributions over behaviours

Distributions over predictions

5. Summary

Distributions over actions

iCEM model predictive control
(Pinneri et al. 2020)

Gaussian process action prior
(ours)

Monte Carlo optimization with pseudo-posteriors.

Algorithm 1: Monte Carlo posterior iteration.

Sample from prior $f^{(k)} = f(\vx^{(k)}),\,\vx^{(k)} \sim p(\cdot)$,
E-step: Compute importance weights, $w_\coloralpha^{(k)} \propto \exp(\frac{1}{\coloralpha}f(\vx))$
M-step: Update prior moments, $\mathop{\min}_p \KL[q(\vx)\mid\mid p(\vx)],\; q(\vx) = \sum_{k=1}^K w_\coloralpha^{(k)} \delta(\vx-\vx^{(k)}).$

For shooting (policy search, MPC, etc), we optimize action sequence $\mA$ for return $U(\mA)$.

How do we preserve action correlations for smoothness?

Sample from the joint distribution! Fit the correlations!

$\va_h \sim p(\cdot)$

$\mA \sim p(\cdot) = p(\va_1, \va_2, \dots)$

$p(\va_1, \va_2, \dots) = \prod_{h=1}^H p(\va_h)$

$q(\mA) = \sum_k w_\coloralpha^{(k)} \delta(\mA - \mA^{(k)})$

$ \min_{\color{#FF217D}{p}} \KL[q(\mA) \mid\mid \color{#FF217D}{p(}\mA\color{#FF217D}{)}]$

How do we model action correlations for smoothness?

Nonparametric Gaussian processes (GP) (Rasmussen & Williams, 2006) capture a diverse range of smoothed stochastic processes, $$p(a_1, a_2, \dots) = \gN(\mathbf{0}, \mSigma) = \gG\gP(0, C(\vh)),\, \vh = [h_1, h_2, \dots].$$

We adopt a matrix normal to structure the $\text{vec}(\mA)\in\sR^{d_aH}$ covariance, $$\underbrace{\mSigma}_{d_aH\times d_aH} = \underbrace{C(\vh)}_{H\times H} \otimes \underbrace{\sigma^2\mI_{d_a}}_{d_a\times d_a}.$$

For MPC, we can shift the solution in continuous time for replanning,

How to choose the posterior temperature $\coloralpha$ for $\colorK$ samples?

$\begin{aligned} w_\coloralpha^{(k)} = \frac{q_\coloralpha(\mA^{(k)})}{p(\mA^{(k)})} \propto \exp\left(\frac{1}{\coloralpha} f(\vx^{(k)})\right) \end{aligned}$

$\begin{aligned} \tilde{\colorK}_\coloralpha = \frac{(\sum_{k=1}^\colorK w_\coloralpha^{(k)})^2}{\sum_{k=1}^\colorK {w_\coloralpha^{(k)}}^2} \in [1,\colorK], \text{ the effective sample size (ESS).} \end{aligned}$

Lower-bound policy search (LBPS).

Instead of maximizing $\sum_{k=1}^\colorK w_\coloralpha^{(k)} U(\mA^{(k)})$, can we maximize a lower-bound of the true expectation that holds with high probability $1-\colordelta$?

$\begin{aligned} \mathop{\max}\limits_{\coloralpha\geq 0}\; \underbrace{\overbrace{\frac{1}{\colorK}\sum_{k=1}^\colorK w_\coloralpha^{(k)} U(\mA^{(k)})}^{\text{Maximize posterior return}} - \lambda_\colordelta\;\overbrace{\sqrt{\frac{1}{\tilde{\colorK}_\coloralpha}}}^{\text{regularize ESS}}}_{\text{lower-bound}}, \quad \lambda_\colordelta = ||U||_\infty\sqrt{\frac{1-\colordelta}{\colordelta}}, \quad \colordelta \in [0,1]. \end{aligned}$

Black-box optimization

Policy search

Model predictive control

$$\min_{\vx\in\gX} f(\vx)$$

Model predictive control

HumanoidStandup-v2
(MPPI).

HumanoidStandup-v2
(LBPS).

Model predictive control

door-v0
(MPPI).

hammer-v0
(MPPI).

door-v0
(LBPS).

hammer-v0
(LBPS).

Inference

Optimize the temperature w.r.t. an importance sampled lower bound on the expectation.

$$\tilde{\colorK}_\coloralpha$$

Models

Smooth kernels and Kronecker-factorized covariances to scale to high-dimensional action spaces.

$$\mSigma = C(\vh) \otimes \sigma^2\mI$$

Priors

Continuous-time stationary Gaussian process prior over open-loop action sequences.

Inferring smooth control: Monte Carlo posterior policy iteration with Gaussian processes,
Watson, J., Peters, J.
Conference on Robot Learning (2022) [oral]

github.com/JoeMWatson/monte-carlo-posterior-policy-iteration

1. Introduction

2. Distributions over actions

Distributions over behaviours

4. Distributions over predictions

5. Summary

Distributions over behaviours

Behavioural cloning (BC) $$ \min_{\pi\in\Pi}\;\E_{\vs,\va\sim \gD_\text{expert}}[\log \pi(\va\mid\vs)]. $$

Simply supervised learning
Suffers from compounding errors and distribution shift for long time horizons

Inverse reinforcement learning (IRL) $$ \max_{r\in\gR}\left\{\E\left[\sum_{h=0}^\infty \gamma^hr(\vs_h,\va_h)\mid \pi_{\text{expert}}\right] - \min_{\pi\in\Pi}\E\left[\sum_{h=0}^\infty \gamma^hr(\vs_h,\va_h)\mid\pi\right]\right\}. $$

Overcomes compounding errors by learning from experience
Tackles an ambiguous and expensive optimization problem

The entropy-regularized (soft) optimal policy is, $$ \pi(\va\mid\vs) = \exp\left(\frac{1}{\coloralpha}(Q(\vs,\va)-V_\coloralpha(\vs))\right)\,p(\va\mid\vs). $$

Rearranging, $$\phantom{\int\frac{\pi}{\pi}}$$

$Q(\vs,\va)$

${}={}$

$\coloralpha\log\frac{\pi(\va\,\mid\,\vs)}{p(\va\,\mid\,\vs)}$

${}+{}$

$V_\coloralpha(\vs)$

Replacing $Q$ into the soft Bellman equation, $$\phantom{\int\frac{\pi}{\pi}}$$

$r(\vs,\va)$

${}={}$

$\coloralpha\log\frac{\pi(\va\,\mid\,\vs)}{p(\va\,\mid\,\vs)}$

${}-\gamma\,\E_{\vs'\sim p(\cdot\mid\vs,\va)}[V_\coloralpha(\vs')] +{}$

$V_\coloralpha(\vs)$

Replacing $Q$ into the soft Bellman equation,

$r(\vs,\va)$

${}={}$

$\coloralpha\log\frac{\pi(\va\,\mid\,\vs)}{p(\va\,\mid\,\vs)}$

${}-\gamma\,\E_{\vs'\sim p(\cdot\mid\vs,\va)}[V_\coloralpha(\vs')] +{}$

$V_\coloralpha(\vs)$

$.$

Compare with a shaped reward $\tilde{r}$ (Ng et al. 1999), which has the same optimal policy as $r$ for any potential $\psi:\gS\rightarrow\sR$,

$r(\vs,\va) ={}$

$\tilde{r}(\vs,\va)$

${}-\gamma\,\E_{\vs'\sim p(\cdot\mid\vs,\va)}[\psi(\vs')] +{}$

$\psi(\vs)$

$.$

$\tilde{r}(\vs,\va)$

${}={}$

$\coloralpha\log\frac{\pi(\va\,\mid\,\vs)}{p(\va\,\mid\,\vs)}$

$,\quad$

$\psi(\vs)$

${}={}$

$V_\coloralpha(\vs)$

$.$

The log policy ratio is a coherent reward for which the BC policy is optimal!

Algorithm 2: Coherent soft imitation learning (CSIL).

Do KL-regularized BC on demonstration data,
Construct the coherent reward,
Do entropy-regularized RL with the coherent reward.

Can coherency work with deep learning?

$\begin{aligned} \pi(\va\mid\vs) &= \gN(\vmu_\vphi(\vs), \mSigma_\vphi(\vs))&&\text{(MLP)},\\ \pi(\va\mid\vs) &= \gN(\mW\varphi_\vphi(\vs), \varphi_\vphi(\vs)^\top\mSigma\varphi_\vphi(\vs))&&\text{(Stationary MLP)},\\ \varphi_\vphi(\vs) &= \color{blue}{\vf_\text{periodic}(}\mW\,\vf_\text{MLP}(\vs)\color{blue}{)}.&& \end{aligned} $

Tabular

Online and
offline MuJoCo

Online
Adroit

Online
RoboMimic

Pixel-based, Offline RoboMimic

Online
RoboMimic

NutAssemblySquare (IQLearn),
$n=200$.

NutAssemblySquare (CSIL),
$n=200$.

Inference

State-of-the-art imitation learning algorithm that avoids an adversarial objective and finetunes BC policies.

Models

Behaviour cloning with stationary parametric policies.

Priors

Inverting KL-regularized reinforcement learning.

$$\coloralpha \log \frac{\pi(\va\mid\vs)}{p(\va\mid\vs)}$$

Coherent soft imitation learning,
Watson, J., Huang, S. H., Heess, N.
Advances in Neural Information Processing Systems (2023) [spotlight]

github.com/google-deepmind/csil/

1. Introduction

2. Distributions over actions

3. Distributions over behaviours

Distributions over predictions

5. Summary

Distributions over predictions

A (differentiable) simulator can be a Gaussian process!

BNN2GP (Khan et al., 2019) approximates a Bayesian neural network posterior with a GP using the Laplace approximation.

SIM2GP (ours) approximates the posterior of a differentiable physics simulator with a GP in the same fashion.

$$q(\vs_{h+1}\mid\vs_h,\va_h) = \gN(\vf_{\vtheta_\text{MAP}}(\vs_h,\va_h),\,\color{blue}{\mJ_h}^\top\mSigma_\vw\color{blue}{\mJ_h}+\sigma^2\mI), \quad \color{blue}{\mJ_h} = \nabla_\vtheta \vf_\vtheta(\vs_h,\va_h)\Big|_{\vtheta=\vtheta_\text{MAP}}. $$

Combining physics and Bayesian function approximation

Neural linear models.

$ \quad q_\vphi(f) = \vw^\top \varphi_\vtheta(\vx), \; \vw \sim \gN(\vmu, \mSigma), \; \vphi = \{\vmu, \mSigma, \vtheta\}. $

Residual models (Saveriano et al., 2017).

$ \vf_{{}_{\text{RES}}} = \vf_{{}_{\text{SIM2GP}}} + \vf_{{}_{\text{NLM}}}, \quad \vf_{{}_{\text{SIM2GP}}} \sim \gG\gP_{{}_{\text{SIM2GP}}}(\cdot), \quad \vf_{{}_{\text{NLM}}} \sim q_\vphi(\cdot). $

Function-space variational inference (Sun et al., 2019).

$ \textstyle\max_{\vphi}\; \E_{\vf\sim q_\vphi(\cdot)}[\log p(\gD\mid\vf)] - \sD[q_\vphi(\vf)\mid\mid p(\vf)]. $

Combining physics and Bayesian function approximation

Function-space variational inference (Sun et al., 2019).

$ \textstyle\max_{\vphi}\; \E_{\vf\sim q_\vphi(\cdot)}[\log p(\gD\mid\vf)] - \sD[q_\vphi(\vf)\mid\mid p(\vf)]. $

$$\sD[q_\vphi(\vf)\mid\mid p(\vf)] \approx \E_{\color{blue}{\vx\sim m(\cdot)}}[\sD[q(\vf_\color{blue}{\vx})\mid\mid p(\vf_\color{blue}{\vx})]].$$

Inductive biases and statistical decision-making

How should we design dynamics models for sequential decision-making?

For MBRL we use posterior sampling RL (PSRL), which has a Bayesian regret bound,

$$\E[\text{Regret}(H, T, \mathsf{Alg}, \gM)] = \tilde{\gO}(H^{3/2}(d_s + d_a)\sqrt{\color{blue}{\tilde{d}}T}).$$

The effective dimension ${\color{blue}{\tilde{d}}}$ is a general measure of model complexity for points $\mX$ and covariance function $C$,

$\begin{aligned} \color{blue}{\tilde{d}} &= \text{Tr}\{(C(\mX,\mX)+\sigma^2\mI)^{-1}C(\mX,\mX)\},\\ C(\mX,\mX) &= \varphi_\vtheta(\mX)^\top\mSigma\varphi_\vtheta(\mX). \end{aligned} $

For sim2gp, ${\color{blue}{\tilde{d}}}$ is upper bounded by the number of unknown physics parameters.
For residual models, ${\color{blue}{\tilde{d}}}$ is upper bounded by the sum of the physics parameters and NLM features.
For FSVI, $\color{blue}{\tilde{d}}$ is regularized to be close to the prior, i.e. the number of unknown physics parameters.

Tabular and linear quadratic PSRL

Deep PSRL

Posterior sampling active learning (PSAL)

Deep PSAL

Physics-informed
PSRL

Physics-informed PSAL

Physics-informed
PSRL

Inference

Combine black-box function approximation and inductive biases using function-space VI.

Models

sim2gp: the linearized Laplace approximation on differentiable physics models.

Priors

Approximate Gaussian process physics models for posterior-sampling sequential decision-making.

Tractable Bayesian dynamics models from differentiable physics for learning and control
Watson, J., Hahner, B., Peters, J.
R:SS differentiable physics workshop (2022), ICRA@40 (2024), In preparation (2024)

To be released.

1. Introduction

2. Distributions over actions

3. Distributions over behaviours

4. Distributions over predictions

Summary

Robot learning algorithms can be enhanced
by using more sophisticated statistical approaches.

Summary

Robot learning algorithms can be enhanced
by using more sophisticated statistical approaches.

Inference

Models

Priors

Distributions over actions

Lower-bound posterior iteration

CoRL (2022)

$\tilde{\colorK}_\coloralpha$

Distributions over behaviours

Coherent soft imitation learning

NeurIPS (2023)

$\coloralpha\log\frac{\pi(\va\mid\vs)}{p(\va\mid\vs)}$

Distributions over predictions

SIM2GP

In preparation (2024)

$\sD[q_\vphi(\vf)\mid\mid p(\vf)]$

Distributions over actions

$\tilde{\colorK}_\coloralpha$

Distributions over behaviours

$\coloralpha\log\frac{\pi(\va\mid\vs)}{p(\va\mid\vs)}$

Distributions over predictions

$\sD[q_\vphi(\vf)\mid\mid p(\vf)]$

Inference, models and priors for control

Appendix

Limitations

Distributions over actions

Lower-bound posterior iteration

CoRL (2022)

GP prior has lengthscale hyperparameter which must be tuned.
Smoothness not always good for control, e.g. pendulum swing-up.
Poor scaling due to fitting dense covariance matrix when $K$ is large.

Distributions over behaviours

Coherent soft imitation learning

NeurIPS (2023)

Requires a task BC can solve with enough data.
Minimum-time tasks need a negative coherent reward tweak.
Stationary policy does not work well on CPU.

Distributions over predictions

SIM2GP

In preparation (2024)

FSVI requires long pre-training on state-action domain.
Resource issues scaling to complex differentiable simulators.
Have yet to demonstrate performance on real hardware.

Further work

Distributions over actions

Lower-bound posterior iteration

CoRL (2022)

Use sparse GP approximations for better scaling.
Look at LBPS between two Gaussians without SNIS.
GPU and C++ implementation for hardware demo.

Distributions over behaviours

Coherent soft imitation learning

NeurIPS (2023)

Unshape coherent reward with soft value function estimate.
Use for non-adversarial motion priors.
Investigate scaling to SOTA BC policy architectures.

Distributions over predictions

SIM2GP

In preparation (2024)

Finish demos for contact-rich simulation and real-world MBRL .
PSRL regret bound for unknown feature spaces?
Extend to latent-variable model MBRL.

How Gaussian processes provide an elegant means to 'time shift' in MPC.

Plan for timesteps $h$ to $h+H$, based on previous solution up to time $\tau$, $$ q_\alpha(\va_{h:h+H}\mid \gO_{1:\tau}) = \textstyle\int q_\alpha(\va_{1:h+H}\mid\gO_{1:\tau})\,\mathrm{d}\va_{1:h} \propto \textstyle\int \underbrace{p(\gO_{1:\tau}\mid\va_{1:\tau})}_{\text{previous solution}}\, \underbrace{p(\va_{1:h+H})}_{\text{time-extended prior}} \, \mathrm{d}\va_{1:h}. $$

The previous solution is due to an unknown likelihood $\color{blue}{\gN(\vmu_{\gO|h_1:h_2}, \mSigma_{\gO|h_1:h_2})}$,
$ \vmu_{h_1:h_2|\gO} = \vmu_{h_1:h_2} + \mK_{h_1:h_2}(\color{blue}{\vmu_{\gO|h_1:h_2}}-\vmu_{h_1:h_2}) ,\quad \mSigma_{h_1:h_2, h_1:h_2|\gO} = \mSigma_{h_1:h_2,h_1:h_2}- \mK_{h_1:h_2}\color{blue}{\mSigma_{\gO|h_1:h_2}}\mK_{h_1:h_2}\tran, $
where $\mK_{h_1:h_2}{\,=\,}\mSigma_{h_1:h_2, h_1:h_2} \color{blue}{\mSigma_{\gO|h_1:h_2}}\inv$.

Compute unknown terms using old posterior and prior,
$\color{magenta}{\vnu_{h_1:h_2}} = \color{blue}{\mSigma_{\gO|h_1:h_2}}\inv(\color{blue}{\vmu_{\gO|h_1:h_2}}-\vmu_{h_1:h_2}) = \mSigma_{h_1:h_2, h_1:h_2}\inv(\vmu_{h_1:h_2|\gO} - \vmu_{h_1:h_2}) $,
$\color{magenta}{\mathbf{\Lambda}_{h_1:h_2}} = \color{blue}{\mSigma_{\gO|h_1:h_2,h_1:h_2}}\inv = \mSigma_{h_1:h_2,h_1:h_2}\inv(\mSigma_{h_1:h_2,h_1:h_2} - \mSigma_{h_1:h_2,h_1:h_2|\gO})\mSigma_{h_1:h_2,h_1:h_2}\inv $.

Combine these likelihood terms with the time-shifted prior $\color{turquoise}{\gN(\vmu_{h_3:h_4}, \mSigma_{h_3:h_4})}$ and cross-correlation $\color{SeaGreen}{\mSigma_{h_3:h_4,h_1:h_2}} = k(\vh_{h_3:h_4},\vh_{h_1:h_2})$
$\vmu_{h_3:h_4|\gO} = \color{turquoise}{\vmu_{h_3:h_4}} + \color{SeaGreen}{\mSigma_{h_3:h_4,h_1:h_2}} \color{magenta}{\vnu_{h_1:h_2}},$
$\mSigma_{h_3:h_4,h_3:h_4|\gO} = \color{turquoise}{\mSigma_{h_3:h_4,h_3:h_4}} - \color{SeaGreen}{\mSigma_{h_3:h_4,h_1:h_2}} \color{magenta}{\mathbf{\Lambda}_{h_1:h_2}} \color{SeaGreen}{\mSigma_{h_3:h_4,h_1:h_2}}\tran.$

How to choose the posterior temperature $\coloralpha$ for $\colorK$ samples?

$\begin{aligned} w_\coloralpha^{(k)} = \frac{q_\coloralpha(\mA^{(k)})}{p(\mA^{(k)})} \propto \exp\left(\frac{1}{\coloralpha} f(\vx^{(k)})\right). \end{aligned}$

$\begin{aligned} \coloralpha_\star = \mathop{\arg\min}\limits_{\coloralpha\geq 0}\; \KL[q_\coloralpha(\mA)\mid\mid p(\mA)] \quad \text{s.t.} \quad \E_{\mA\sim q_\coloralpha(\cdot)}[U(\mA)] = U_\star. \vphantom{\sum_{k=1}^\colorK} \quad\quad\quad\quad\quad\quad\quad\quad\quad\;\, \end{aligned}$

$\begin{aligned} \coloralpha_\star = \mathop{\arg\min}\limits_{\coloralpha\geq 0}\; \KL[q_\coloralpha(\mA)\mid\mid p(\mA)] \quad \text{s.t.} \quad \frac{1}{\colorK}\sum_{k=1}^\colorK\frac{q_\coloralpha(\mA^{(k)})}{p(\mA^{(k)})}U(\mA^{(k)}) = U_{\color{#81a274}{?}}, \; \mA^{(k)}\sim p(\cdot).\end{aligned}$

$\begin{aligned}\tilde{\coloralpha}_\star \approx \mathop{\arg\max}\limits_{\coloralpha\geq 0}\;\tilde{U}_\star(\colorK,\coloralpha) \quad \text{ s.t.} \quad \mathrm{P} \left(\E_{\mA \sim q_\coloralpha(\cdot)}[U(\mA)] \geq \tilde{U}_\star(\colorK,\coloralpha)\vphantom{\int}\right) > 1-\colordelta,\;\colordelta \in [0, 1],\end{aligned}$

$\begin{aligned}\phantom{\tilde{\coloralpha}_\star} = \mathop{\arg\max}\limits_{\coloralpha\geq 0}\; \underbrace{\frac{1}{\colorK}\sum_{k=1}^\colorK\frac{q_\coloralpha(\mA^{(k)})}{p(\mA^{(k)})}U(\mA^{(k)}) - ||U||_\infty\sqrt{\frac{1-\colordelta}{\colordelta}\frac{1}{\tilde{\colorK}_\coloralpha}}}_{\tilde{U}_\star(\colorK,\coloralpha), \text{ lower-bound}}, \; \tilde{\colorK}_\coloralpha \approx \frac{\colorK}{d_2[q_\coloralpha(\mA)\mid\mid p(\mA)]}. \end{aligned}$

Why does the coherent reward work?

$ \begin{aligned} \tilde{r}(\vs,\va) = \coloralpha\log\frac{\pi(\va\mid\vs)}{p(\va\mid\vs)} \phantom{\begin{cases} > 0 \text{ if } \vs\in\gD_{\text{expert}},\va\in\gD_{\text{expert}},\\ \leq 0 \text{ if } \vs\in\gD_{\text{expert}},\va\not\in\gD_{\text{expert}},\\ \approx 0 \text{ if } \vs\not\in\gD_{\text{expert}}\forall\va\in\gA. \end{cases} } \end{aligned} $

$ \begin{aligned} \tilde{r}(\vs,\va) = \coloralpha\log\frac{\pi(\va\mid\vs)}{p(\va\mid\vs)} \begin{cases} > 0 \text{ if } \vs\in\gD_{\text{expert}},\va\in\gD_{\text{expert}},\\ \phantom{\leq 0 \text{ if } \vs\in\gD_{\text{expert}},\va\not\in\gD_{\text{expert}},}\\ \phantom{\approx 0 \text{ if } \vs\not\in\gD_{\text{expert}}\forall\va\in\gA.} \end{cases} \end{aligned} $

$ \begin{aligned} \tilde{r}(\vs,\va) = \coloralpha\log\frac{\pi(\va\mid\vs)}{p(\va\mid\vs)} \begin{cases} > 0 \text{ if } \vs\in\gD_{\text{expert}},\va\in\gD_{\text{expert}},\\ \leq 0 \text{ if } \vs\in\gD_{\text{expert}},\va\not\in\gD_{\text{expert}},\\ \phantom{\approx 0 \text{ if } \vs\not\in\gD_{\text{expert}}\forall\va\in\gA.} \end{cases} \end{aligned} $

$ \begin{aligned} \tilde{r}(\vs,\va) = \coloralpha\log\frac{\pi(\va\mid\vs)}{p(\va\mid\vs)} \begin{cases} > 0 \text{ if } \vs\in\gD_{\text{expert}},\va\in\gD_{\text{expert}},\\ \leq 0 \text{ if } \vs\in\gD_{\text{expert}},\va\not\in\gD_{\text{expert}},\\ \approx 0 \text{ if } \vs\not\in\gD_{\text{expert}}\forall\va\in\gA. \end{cases} \end{aligned} $

Online
Adroit

hammer-v0 (IQLearn), $n=30$.

hammer-v0 (CSIL),
$n=30$.

Balancing exploration and caution under uncertainty.

Posterior averaging reinforcement learning (PARL).
Optimize policy over posterior predictive dynamics, $$\pi_{\vphi,\star}(\vs) = \mathop{\arg\max}_{\va\in\gA}Q_\star^{\vf_\vphi, H}(\vs,\va), \; \vf_\vphi = q_\phi(\vs_{h+1}\mid\vs,\va). $$

Hugely influential (e.g., PILCO, PETS, MPBO).
No principled exploration mechanism.
May exhibit the "turn-off phenomenon" if prior is misspecified.

Posterior sampling reinforcement learning (PSRL).
Compute optimal policy for a posterior sample, $$\pi_{k,\star}(\vs) = \mathop{\arg\max}_{\va\in\gA}Q_\star^{\vf^{(k)}\!, H}(\vs,\va), \; \vf^{(k)} \sim q_\vphi(\cdot). $$

Principled exploration with bounded Bayesian regret.
No hyperparameter for controlling exploration.

$\colorK-$posterior sampling reinforcement learning.
Compute optimal policy over $\colorK$ posterior samples, $$\pi_{\colorK,\star}(\vs) = \mathop{\arg\max}_{\va\in\gA}{\frac{1}{\colorK}}\!\sum_{k=1}^KQ_\star^{{\vf^{(k)}}\!, H}(\vs,\va), \;\vf^{(k)} \sim q_\vphi(\cdot).$$

Generalizes posterior sampling with $\colorK=1$.
$\colorK\!\rightarrow\!\infty$ is the function-space posterior average.

Reasoning about the effective dimension.

Residual models. We can think of the addition of the predictions from two GPs as a single GP with a concatenated feature space and factorized weight covariance $$ C(\vx) = \begin{bmatrix} \varphi_1(\vx)\\ \varphi_2(\vx)\\ \end{bmatrix}^\top \begin{bmatrix} \mW_1 & \vzero\\ \vzero & \mW_2\\ \end{bmatrix} \begin{bmatrix} \varphi_1(\vx)\\ \varphi_2(\vx)\\ \end{bmatrix}, $$ and therefore the SIM2GP and NLM feature dimensions are combined to bound $\tilde{d}$.

Function-space variational inference. Another way of writing the effective dimension is $\tilde{d} = \sigma^{-2}\sum_{n=1}^N\sigma^2(\vx_n)$ over the $N$ points in dataset $\mX$ (Janz et al. 2020). As FSVI seeks to match the predictive variances inside the dataset, if $|\sigma_q(\vx) - \sigma_p(\vx)|<\epsilon\;\forall\;\vx\in\mX$, and the prior and posterior process share the same aleatoric variance $\sigma^2$, then $$ |\tilde{d}_q - \tilde{d}_p| \leq N\sigma^{-2}\epsilon. $$

Relevance to foundation models / generative AI.

Distributions over actions

Lower-bound posterior iteration

CoRL (2022)

Action chunking is similar to MPC
Markovian GPs could be used for continuous-time chunking
LBPS could be used to train generative models that use IS.

Distributions over behaviours

Coherent soft imitation learning

NeurIPS (2023)

The coherent reward is used in DPO, a (concurrent) SOTA RLHF algorithm.
Imitation learning is relevant for training LLMs beyond NTP.

Inference, Models and Priorsfor Control

Inference, Models and Priors
for Control