Joe Watson
Thesis Defence
Intelligent Autonomous Systems, Technical University of Darmstadt
Systems AI for Robot Learning, German Research Center for AI (DFKI)
Learn from demonstrations,
$\max_{\pi} \E_{\vs,\va\sim\gD_{\text{expert}}}[\log\pi(\va\mid\vs)].$
Sample actions from policy,
$\va_h \sim \pi(\cdot\mid\vs_h).$
Interact with environment,
$r_h = r(\vs_h,\va_h),\;\vs_{h+1} \sim p(\cdot\mid\vs_h,\va_h),\;\vtau = [\vs_1, \va_1, \dots].$
Maximize return,
Open problems:
The promise of robot learning
Imitation
Learn behaviours using demonstrations $\gD_{\text{expert}}$ and policy $\pi$.
Exploration
Collect data from the environment to learn efficiently.
Improvement
Learn to solve tasks using reward $r(\vs,\va)$, 
Some open problems:
Consider any decision-making problem as stochastic optimization, \begin{aligned} q_{\star\vphantom{\coloralpha}}(\vx) &= \mathop{\arg\max}_{q}\;\E_{\vx\sim q(\cdot)}[ \color{#81a274}{f(\vx)} ]\phantom{{}-\coloralpha\,\KL[q(\vx)\mid\mid p(\vx)]}\\ &\phantom{\,\,\propto \exp\left(\frac{1}{\coloralpha} f(\vx)\right)\,p(\vx).} \end{aligned}
Convexify this objective with entropy-regularization, \begin{aligned} q_{\coloralpha}(\vx) &= \mathop{\arg\max}_{q}\;\E_{\vx\sim q(\cdot)}[ \color{#81a274}{f(\vx)} ] - \coloralpha\,\KL[q(\vx)\mid\mid \color{#58A4B0}{p(\vx)}]\\ &\phantom{\,\,\propto \exp\left(\frac{1}{\coloralpha} f(\vx)\right)\,p(\vx).} \end{aligned}
This problem has a closed-form solution, a `pseudo-posterior', \begin{aligned} \color{DarkOrchid}{q_{\coloralpha}(\vx)} &= \mathop{\arg\max}_{q}\;\E_{\vx\sim q(\cdot)}[ \color{#81a274}{f(\vx)} ]- \coloralpha\,\KL[q(\vx)\mid\mid \color{#58A4B0}{p(\vx)} ]\\ &\,\,\propto \exp\left(\frac{1}{\coloralpha} \color{#81a274}{f(\vx)} \right)\, \color{#58A4B0}{p(\vx)}. \end{aligned}
Bayesian inference,
$
\begin{aligned}
\color{DarkOrchid}{q_\color{blue}{N}(\vtheta)}
\propto
\exp\left(\frac{\color{blue}{N}}{\color{#81a274}{N}}\color{#81a274}{\sum_{n=1}^N \log
p(\vx_n\mid\vtheta)}\right)
\color{#58A4B0}{p(\vtheta)}
\quad
\text{(Bayes' posterior)}.
\end{aligned}
$
Online mirror descent (Vovk (1990)),
$
\begin{aligned}
\color{DarkOrchid}{q_{\color{blue}{t}}(\vx)}
&\propto \exp\left(
\vphantom{\sum_{n=1}^N}
\color{blue}{\eta_t}
\color{#81a274}{\sum_{h=1}^t f_h(\vx)}
\right)
\color{#58A4B0}{p(\vx)}
\quad
\text{(Exponential weights)}.
\end{aligned}$
Reinforcement learning (Ziebart et al. (2008), Peters et al. (2010)),
$
\begin{aligned}
\color{DarkOrchid}{\pi_\coloralpha(\va\mid\vs)}
\propto
\exp\left(\vphantom{\sum_{n=1}^N}\frac{1}{\coloralpha}
\color{#81a274}{Q(\vs,\va)}\right)
\color{#58A4B0}{p(\va\mid\vs)}
\quad
\text{(Boltzmann policy)}.
\end{aligned}
$
Variational inequality (Donsker & Varadhan (1976)),
$
\begin{aligned}
\vphantom{\sum_{n=1}^N}
\E_{\vx\sim\color{DarkOrchid}{q_\coloralpha(\vx)}}
[\color{#81a274}{f(\vx)}]
-
\E_{\vx\sim\color{#58A4B0}{p(\vx)}}
[\color{#81a274}{f(\vx)}]
\geq \coloralpha\,
\KL[\color{DarkOrchid}{q_\coloralpha(\vx)}\mid\mid \color{#58A4B0}{p(\vx)}] \geq 0.
\end{aligned}
$
Robot learning algorithms can be enhanced by
using more sophisticated statistical approaches.
Robot learning algorithms can be
enhanced by
using more sophisticated statistical approaches.
Inference
$\color{#58A4B0}{\text{max}_{q}}\;\color{#58A4B0}{\E_{\vx\sim q(\cdot)}[} f(\vx) \color{#58A4B0}{]} - \frac{1}{\color{#58A4B0}{\alpha}}\color{#58A4B0}{\KL[}q(\vx) \;\color{#58A4B0}{\mid\mid}\; p(\vx) \color{#58A4B0}{]}$
Models
$\text{max}_{q}\;\E_{\vx\sim \color{#FF217D}{q(}\cdot \color{#FF217D}{)} }[f(\vx)]-\frac{1}{\alpha}\KL[ \color{#FF217D}{q(} \vx \color{#FF217D}{)} \mid\mid \color{#FF217D}{p(}\vx \color{#FF217D}{)} ]$
Priors
$\text{max}_{q}\;\E_{\vx\sim q(\cdot)}[f(\vx)]-\frac{1}{\alpha}\KL[q(\vx)\mid\mid \color{blue}{p(\vx)}]$
skills
manipulation
locomotion
optimal control
model-free
model-based
imitation
Robot learning algorithms can be
enhanced by
using more sophisticated statistical approaches.
Distributions over actions
Inferring smooth control: Monte Carlo posterior policy iteration with Gaussian processes, CoRL (2022)
Distributions over behaviours
Coherent soft imitation learning,
NeurIPS (2023)
Distributions over predictions
Tractable Bayesian dynamics models from differentiable physics for learning and
control,
R:SS WS (2022), ICRA@40 (2024), In preparation (2024)
$p(\va_1,\dots,\va_H)$
$p(\va_h\mid\vs_h)$
$p(\vs_{h+1}\mid\vs_h,\va_h)$
1. Introduction
2.
Distributions over actions
3.
Distributions over behaviours
4.
Distributions over predictions
5. Summary
Distributions over actions
Distributions over actions
iCEM model predictive control
(Pinneri et al. 2020)
Gaussian process action prior
(ours)
Monte Carlo optimization with pseudo-posteriors.
Algorithm 1: Monte Carlo posterior iteration.
For shooting (policy search, MPC, etc), we optimize action sequence $\mA$ for return $U(\mA)$.
How do we preserve action correlations for smoothness?
Sample from the joint distribution!
Fit the correlations!
How do we model action correlations for smoothness?
Nonparametric Gaussian processes (GP) (Rasmussen & Williams, 2006) capture a diverse range of smoothed stochastic processes, $$p(a_1, a_2, \dots) = \gN(\mathbf{0}, \mSigma) = \gG\gP(0, C(\vh)),\, \vh = [h_1, h_2, \dots].$$
We adopt a matrix normal to structure the $\text{vec}(\mA)\in\sR^{d_aH}$ covariance, $$\underbrace{\mSigma}_{d_aH\times d_aH} = \underbrace{C(\vh)}_{H\times H} \otimes \underbrace{\sigma^2\mI_{d_a}}_{d_a\times d_a}.$$
For MPC, we can shift the solution in continuous time for replanning,
How to choose the posterior temperature $\coloralpha$ for $\colorK$ samples?
$\begin{aligned} w_\coloralpha^{(k)} = \frac{q_\coloralpha(\mA^{(k)})}{p(\mA^{(k)})} \propto \exp\left(\frac{1}{\coloralpha} f(\vx^{(k)})\right) \end{aligned}$
$\begin{aligned} \tilde{\colorK}_\coloralpha = \frac{(\sum_{k=1}^\colorK w_\coloralpha^{(k)})^2}{\sum_{k=1}^\colorK {w_\coloralpha^{(k)}}^2} \in [1,\colorK], \text{ the effective sample size (ESS).} \end{aligned}$
Lower-bound policy search (LBPS).
Instead of maximizing $\sum_{k=1}^\colorK w_\coloralpha^{(k)} U(\mA^{(k)})$, can we maximize a lower-bound of the true expectation that holds with high probability $1-\colordelta$?
$\begin{aligned} \mathop{\max}\limits_{\coloralpha\geq 0}\; \underbrace{\overbrace{\frac{1}{\colorK}\sum_{k=1}^\colorK w_\coloralpha^{(k)} U(\mA^{(k)})}^{\text{Maximize posterior return}} - \lambda_\colordelta\;\overbrace{\sqrt{\frac{1}{\tilde{\colorK}_\coloralpha}}}^{\text{regularize ESS}}}_{\text{lower-bound}}, \quad \lambda_\colordelta = ||U||_\infty\sqrt{\frac{1-\colordelta}{\colordelta}}, \quad \colordelta \in [0,1]. \end{aligned}$
Black-box optimization
Policy search
Model predictive control
$$\min_{\vx\in\gX} f(\vx)$$
Model predictive control
Model predictive control
Inference
Optimize the temperature w.r.t. an importance sampled lower bound on the expectation.
Models
Smooth kernels and Kronecker-factorized covariances to scale to high-dimensional action spaces.
$$\mSigma = C(\vh) \otimes \sigma^2\mI$$
Priors
Continuous-time stationary Gaussian process prior over open-loop action sequences.
Inferring smooth control: Monte Carlo
posterior policy iteration
with Gaussian processes,
Watson, J., Peters, J.
Conference on Robot Learning (2022)
[oral]
github.com/JoeMWatson/monte-carlo-posterior-policy-iteration
1. Introduction
2. Distributions over actions
3.
Distributions over behaviours
4. Distributions over predictions
5. Summary
Distributions over behaviours
Distributions over behaviours
Behavioural cloning (BC) $$ \min_{\pi\in\Pi}\;\E_{\vs,\va\sim \gD_\text{expert}}[\log \pi(\va\mid\vs)]. $$
Inverse reinforcement learning (IRL) $$ \max_{r\in\gR}\left\{\E\left[\sum_{h=0}^\infty \gamma^hr(\vs_h,\va_h)\mid \pi_{\text{expert}}\right] - \min_{\pi\in\Pi}\E\left[\sum_{h=0}^\infty \gamma^hr(\vs_h,\va_h)\mid\pi\right]\right\}. $$
The entropy-regularized (soft) optimal policy is, $$ \pi(\va\mid\vs) = \exp\left(\frac{1}{\coloralpha}(Q(\vs,\va)-V_\coloralpha(\vs))\right)\,p(\va\mid\vs). $$
Rearranging, $$\phantom{\int\frac{\pi}{\pi}}$$
Replacing $Q$ into the soft Bellman equation, $$\phantom{\int\frac{\pi}{\pi}}$$
Replacing $Q$ into the soft Bellman equation,
Compare with a shaped reward $\tilde{r}$ (Ng et al. 1999), which has the same optimal policy as $r$ for any potential $\psi:\gS\rightarrow\sR$,
 
The log policy ratio is a coherent reward for which the BC policy is optimal!
 
Algorithm 2: Coherent soft imitation learning (CSIL).
Can coherency work with deep learning?
$\begin{aligned} \pi(\va\mid\vs) &= \gN(\vmu_\vphi(\vs), \mSigma_\vphi(\vs))&&\text{(MLP)},\\ \pi(\va\mid\vs) &= \gN(\mW\varphi_\vphi(\vs), \varphi_\vphi(\vs)^\top\mSigma\varphi_\vphi(\vs))&&\text{(Stationary MLP)},\\ \varphi_\vphi(\vs) &= \color{blue}{\vf_\text{periodic}(}\mW\,\vf_\text{MLP}(\vs)\color{blue}{)}.&& \end{aligned} $
Tabular
Online and
offline MuJoCo
Online
Adroit
Online
RoboMimic
Pixel-based, Offline RoboMimic
Online
RoboMimic
Inference
State-of-the-art imitation learning algorithm that avoids an adversarial objective and finetunes BC policies.
Models
Behaviour cloning with stationary parametric policies.
Priors
Inverting KL-regularized reinforcement learning.
$$\coloralpha \log \frac{\pi(\va\mid\vs)}{p(\va\mid\vs)}$$
Coherent soft imitation learning,
Watson, J., Huang, S. H., Heess, N.
Advances in Neural Information Processing Systems (2023) [spotlight]
github.com/google-deepmind/csil/
1. Introduction
2. Distributions over actions
3. Distributions over behaviours
4.
Distributions over predictions
5. Summary
Distributions over predictions
A (differentiable) simulator can be a Gaussian process!
BNN2GP (Khan et al., 2019) approximates a Bayesian neural network
posterior with a GP using the Laplace approximation.
SIM2GP (ours) approximates the posterior of a differentiable physics simulator with a GP in the same fashion.
$$q(\vs_{h+1}\mid\vs_h,\va_h) = \gN(\vf_{\vtheta_\text{MAP}}(\vs_h,\va_h),\,\color{blue}{\mJ_h}^\top\mSigma_\vw\color{blue}{\mJ_h}+\sigma^2\mI), \quad \color{blue}{\mJ_h} = \nabla_\vtheta \vf_\vtheta(\vs_h,\va_h)\Big|_{\vtheta=\vtheta_\text{MAP}}. $$
Combining physics and Bayesian function approximation
Neural linear models.
$ \quad q_\vphi(f) = \vw^\top \varphi_\vtheta(\vx), \; \vw \sim \gN(\vmu, \mSigma), \; \vphi = \{\vmu, \mSigma, \vtheta\}. $
Residual models (Saveriano et al., 2017).
$ \vf_{{}_{\text{RES}}} = \vf_{{}_{\text{SIM2GP}}} + \vf_{{}_{\text{NLM}}}, \quad \vf_{{}_{\text{SIM2GP}}} \sim \gG\gP_{{}_{\text{SIM2GP}}}(\cdot), \quad \vf_{{}_{\text{NLM}}} \sim q_\vphi(\cdot). $
Function-space variational inference (Sun et al., 2019).
$ \textstyle\max_{\vphi}\; \E_{\vf\sim q_\vphi(\cdot)}[\log p(\gD\mid\vf)] - \sD[q_\vphi(\vf)\mid\mid p(\vf)]. $
Combining physics and Bayesian function approximation
Function-space variational inference (Sun et al., 2019).
$ \textstyle\max_{\vphi}\; \E_{\vf\sim q_\vphi(\cdot)}[\log p(\gD\mid\vf)] - \sD[q_\vphi(\vf)\mid\mid p(\vf)]. $
$$\sD[q_\vphi(\vf)\mid\mid p(\vf)] \approx \E_{\color{blue}{\vx\sim m(\cdot)}}[\sD[q(\vf_\color{blue}{\vx})\mid\mid p(\vf_\color{blue}{\vx})]].$$
Inductive biases and statistical decision-making
How should we design dynamics models for sequential decision-making?
For MBRL we use posterior sampling RL (PSRL), which has a Bayesian regret bound,
$$\E[\text{Regret}(H, T, \mathsf{Alg}, \gM)] = \tilde{\gO}(H^{3/2}(d_s + d_a)\sqrt{\color{blue}{\tilde{d}}T}).$$
The effective dimension ${\color{blue}{\tilde{d}}}$ is a general measure of model complexity for points $\mX$ and covariance function $C$,
$\begin{aligned} \color{blue}{\tilde{d}} &= \text{Tr}\{(C(\mX,\mX)+\sigma^2\mI)^{-1}C(\mX,\mX)\},\\ C(\mX,\mX) &= \varphi_\vtheta(\mX)^\top\mSigma\varphi_\vtheta(\mX). \end{aligned} $
Tabular and linear quadratic PSRL
Deep PSRL
Posterior sampling active learning (PSAL)
Deep PSAL
Physics-informed
PSRL
Physics-informed PSAL
Physics-informed
PSRL
Inference
Combine black-box function approximation and inductive biases using function-space VI.
Models
sim2gp: the linearized Laplace approximation on differentiable physics models.
Priors
Approximate Gaussian process physics models for posterior-sampling sequential decision-making.
Tractable Bayesian dynamics models from differentiable physics for learning and control
Watson, J., Hahner, B., Peters, J.
R:SS differentiable physics workshop (2022), ICRA@40 (2024), In preparation (2024)
To be released.
1. Introduction
2. Distributions over actions
3. Distributions over behaviours
4. Distributions over predictions
5.
Summary
Summary
Robot learning algorithms can be enhanced
by using
more sophisticated statistical approaches.
Summary
Robot learning
algorithms can be enhanced
by using more sophisticated statistical approaches.
Inference
Models
Priors
Distributions over actions
Lower-bound posterior iteration
CoRL (2022)
$\tilde{\colorK}_\coloralpha$
Distributions over behaviours
Coherent soft imitation learning
NeurIPS (2023)
$\coloralpha\log\frac{\pi(\va\mid\vs)}{p(\va\mid\vs)}$
Distributions over predictions
SIM2GP
In preparation (2024)
$\sD[q_\vphi(\vf)\mid\mid p(\vf)]$
Distributions over actions
$\tilde{\colorK}_\coloralpha$
Distributions over behaviours
$\coloralpha\log\frac{\pi(\va\mid\vs)}{p(\va\mid\vs)}$
Distributions over predictions
$\sD[q_\vphi(\vf)\mid\mid p(\vf)]$
Inference, models and priors for control







Appendix
Limitations
Distributions over actions
Lower-bound posterior iteration
CoRL (2022)
Distributions over behaviours
Coherent soft imitation learning
NeurIPS (2023)
Distributions over predictions
SIM2GP
In preparation (2024)
Further work
Distributions over actions
Lower-bound posterior iteration
CoRL (2022)
Distributions over behaviours
Coherent soft imitation learning
NeurIPS (2023)
Distributions over predictions
SIM2GP
In preparation (2024)
How Gaussian processes provide an elegant means to 'time shift' in MPC.
Plan for timesteps $h$ to $h+H$, based on previous solution up to time $\tau$,
$$
q_\alpha(\va_{h:h+H}\mid \gO_{1:\tau})
=
\textstyle\int q_\alpha(\va_{1:h+H}\mid\gO_{1:\tau})\,\mathrm{d}\va_{1:h}
\propto
\textstyle\int
\underbrace{p(\gO_{1:\tau}\mid\va_{1:\tau})}_{\text{previous solution}}\,
\underbrace{p(\va_{1:h+H})}_{\text{time-extended prior}}
\,
\mathrm{d}\va_{1:h}.
$$
The previous solution is due to an unknown likelihood
$\color{blue}{\gN(\vmu_{\gO|h_1:h_2}, \mSigma_{\gO|h_1:h_2})}$,
$
\vmu_{h_1:h_2|\gO} = \vmu_{h_1:h_2} + \mK_{h_1:h_2}(\color{blue}{\vmu_{\gO|h_1:h_2}}-\vmu_{h_1:h_2})
,\quad
\mSigma_{h_1:h_2, h_1:h_2|\gO} = \mSigma_{h_1:h_2,h_1:h_2}- \mK_{h_1:h_2}\color{blue}{\mSigma_{\gO|h_1:h_2}}\mK_{h_1:h_2}\tran,
$
where $\mK_{h_1:h_2}{\,=\,}\mSigma_{h_1:h_2, h_1:h_2}
\color{blue}{\mSigma_{\gO|h_1:h_2}}\inv$.
Compute unknown terms using old posterior and prior,
$\color{magenta}{\vnu_{h_1:h_2}} = \color{blue}{\mSigma_{\gO|h_1:h_2}}\inv(\color{blue}{\vmu_{\gO|h_1:h_2}}-\vmu_{h_1:h_2}) =
\mSigma_{h_1:h_2, h_1:h_2}\inv(\vmu_{h_1:h_2|\gO} - \vmu_{h_1:h_2})
$,
$\color{magenta}{\mathbf{\Lambda}_{h_1:h_2}} =
\color{blue}{\mSigma_{\gO|h_1:h_2,h_1:h_2}}\inv =
\mSigma_{h_1:h_2,h_1:h_2}\inv(\mSigma_{h_1:h_2,h_1:h_2} - \mSigma_{h_1:h_2,h_1:h_2|\gO})\mSigma_{h_1:h_2,h_1:h_2}\inv
$.
Combine these likelihood terms with the time-shifted prior $\color{turquoise}{\gN(\vmu_{h_3:h_4}, \mSigma_{h_3:h_4})}$ and cross-correlation
$\color{SeaGreen}{\mSigma_{h_3:h_4,h_1:h_2}} = k(\vh_{h_3:h_4},\vh_{h_1:h_2})$
$\vmu_{h_3:h_4|\gO} = \color{turquoise}{\vmu_{h_3:h_4}} + \color{SeaGreen}{\mSigma_{h_3:h_4,h_1:h_2}}
\color{magenta}{\vnu_{h_1:h_2}},$
$\mSigma_{h_3:h_4,h_3:h_4|\gO} = \color{turquoise}{\mSigma_{h_3:h_4,h_3:h_4}} - \color{SeaGreen}{\mSigma_{h_3:h_4,h_1:h_2}}
\color{magenta}{\mathbf{\Lambda}_{h_1:h_2}}
\color{SeaGreen}{\mSigma_{h_3:h_4,h_1:h_2}}\tran.$
How to choose the posterior temperature $\coloralpha$ for $\colorK$ samples?
$\begin{aligned} w_\coloralpha^{(k)} = \frac{q_\coloralpha(\mA^{(k)})}{p(\mA^{(k)})} \propto \exp\left(\frac{1}{\coloralpha} f(\vx^{(k)})\right). \end{aligned}$
$\begin{aligned} \coloralpha_\star = \mathop{\arg\min}\limits_{\coloralpha\geq 0}\; \KL[q_\coloralpha(\mA)\mid\mid p(\mA)] \quad \text{s.t.} \quad \E_{\mA\sim q_\coloralpha(\cdot)}[U(\mA)] = U_\star. \vphantom{\sum_{k=1}^\colorK} \quad\quad\quad\quad\quad\quad\quad\quad\quad\;\, \end{aligned}$
$\begin{aligned} \coloralpha_\star = \mathop{\arg\min}\limits_{\coloralpha\geq 0}\; \KL[q_\coloralpha(\mA)\mid\mid p(\mA)] \quad \text{s.t.} \quad \frac{1}{\colorK}\sum_{k=1}^\colorK\frac{q_\coloralpha(\mA^{(k)})}{p(\mA^{(k)})}U(\mA^{(k)}) = U_{\color{#81a274}{?}}, \; \mA^{(k)}\sim p(\cdot).\end{aligned}$
$\begin{aligned} \tilde{\colorK}_\coloralpha = \frac{(\sum_{k=1}^\colorK w_\coloralpha^{(k)})^2}{\sum_{k=1}^\colorK {w_\coloralpha^{(k)}}^2} \in [1,\colorK], \text{ the effective sample size.}\end{aligned}$
$\begin{aligned}\tilde{\coloralpha}_\star \approx \mathop{\arg\max}\limits_{\coloralpha\geq 0}\;\tilde{U}_\star(\colorK,\coloralpha) \quad \text{ s.t.} \quad \mathrm{P} \left(\E_{\mA \sim q_\coloralpha(\cdot)}[U(\mA)] \geq \tilde{U}_\star(\colorK,\coloralpha)\vphantom{\int}\right) > 1-\colordelta,\;\colordelta \in [0, 1],\end{aligned}$
$\begin{aligned}\phantom{\tilde{\coloralpha}_\star} = \mathop{\arg\max}\limits_{\coloralpha\geq 0}\; \underbrace{\frac{1}{\colorK}\sum_{k=1}^\colorK\frac{q_\coloralpha(\mA^{(k)})}{p(\mA^{(k)})}U(\mA^{(k)}) - ||U||_\infty\sqrt{\frac{1-\colordelta}{\colordelta}\frac{1}{\tilde{\colorK}_\coloralpha}}}_{\tilde{U}_\star(\colorK,\coloralpha), \text{ lower-bound}}, \; \tilde{\colorK}_\coloralpha \approx \frac{\colorK}{d_2[q_\coloralpha(\mA)\mid\mid p(\mA)]}. \end{aligned}$
Why does the coherent reward work?
$ \begin{aligned} \tilde{r}(\vs,\va) = \coloralpha\log\frac{\pi(\va\mid\vs)}{p(\va\mid\vs)} \phantom{\begin{cases} > 0 \text{ if } \vs\in\gD_{\text{expert}},\va\in\gD_{\text{expert}},\\ \leq 0 \text{ if } \vs\in\gD_{\text{expert}},\va\not\in\gD_{\text{expert}},\\ \approx 0 \text{ if } \vs\not\in\gD_{\text{expert}}\forall\va\in\gA. \end{cases} } \end{aligned} $
$ \begin{aligned} \tilde{r}(\vs,\va) = \coloralpha\log\frac{\pi(\va\mid\vs)}{p(\va\mid\vs)} \begin{cases} > 0 \text{ if } \vs\in\gD_{\text{expert}},\va\in\gD_{\text{expert}},\\ \phantom{\leq 0 \text{ if } \vs\in\gD_{\text{expert}},\va\not\in\gD_{\text{expert}},}\\ \phantom{\approx 0 \text{ if } \vs\not\in\gD_{\text{expert}}\forall\va\in\gA.} \end{cases} \end{aligned} $
$ \begin{aligned} \tilde{r}(\vs,\va) = \coloralpha\log\frac{\pi(\va\mid\vs)}{p(\va\mid\vs)} \begin{cases} > 0 \text{ if } \vs\in\gD_{\text{expert}},\va\in\gD_{\text{expert}},\\ \leq 0 \text{ if } \vs\in\gD_{\text{expert}},\va\not\in\gD_{\text{expert}},\\ \phantom{\approx 0 \text{ if } \vs\not\in\gD_{\text{expert}}\forall\va\in\gA.} \end{cases} \end{aligned} $
$ \begin{aligned} \tilde{r}(\vs,\va) = \coloralpha\log\frac{\pi(\va\mid\vs)}{p(\va\mid\vs)} \begin{cases} > 0 \text{ if } \vs\in\gD_{\text{expert}},\va\in\gD_{\text{expert}},\\ \leq 0 \text{ if } \vs\in\gD_{\text{expert}},\va\not\in\gD_{\text{expert}},\\ \approx 0 \text{ if } \vs\not\in\gD_{\text{expert}}\forall\va\in\gA. \end{cases} \end{aligned} $
Online
Adroit
Balancing exploration and caution under uncertainty.
Posterior averaging reinforcement learning (PARL).
Optimize policy
over posterior predictive dynamics,
$$\pi_{\vphi,\star}(\vs) = \mathop{\arg\max}_{\va\in\gA}Q_\star^{\vf_\vphi, H}(\vs,\va),
\; \vf_\vphi = q_\phi(\vs_{h+1}\mid\vs,\va).
$$
Posterior sampling reinforcement learning (PSRL).
Compute optimal
policy for a posterior sample,
$$\pi_{k,\star}(\vs) = \mathop{\arg\max}_{\va\in\gA}Q_\star^{\vf^{(k)}\!, H}(\vs,\va),
\;
\vf^{(k)} \sim q_\vphi(\cdot).
$$
$\colorK-$posterior sampling reinforcement learning.
Compute optimal
policy over $\colorK$ posterior samples,
$$\pi_{\colorK,\star}(\vs) =
\mathop{\arg\max}_{\va\in\gA}{\frac{1}{\colorK}}\!\sum_{k=1}^KQ_\star^{{\vf^{(k)}}\!,
H}(\vs,\va),
\;\vf^{(k)} \sim q_\vphi(\cdot).$$
Reasoning about the effective dimension.
Residual models.
We can think of the addition of the predictions from two GPs as a single GP with a concatenated feature space and factorized weight covariance
$$
C(\vx) =
\begin{bmatrix}
\varphi_1(\vx)\\
\varphi_2(\vx)\\
\end{bmatrix}^\top
\begin{bmatrix}
\mW_1 & \vzero\\
\vzero & \mW_2\\
\end{bmatrix}
\begin{bmatrix}
\varphi_1(\vx)\\
\varphi_2(\vx)\\
\end{bmatrix},
$$
and therefore the SIM2GP and NLM feature dimensions are combined to bound $\tilde{d}$.
Function-space variational inference.
Another way of writing the effective dimension is $\tilde{d} = \sigma^{-2}\sum_{n=1}^N\sigma^2(\vx_n)$ over the $N$ points in dataset $\mX$ (Janz et al. 2020).
As FSVI seeks to match the predictive variances inside the dataset, if $|\sigma_q(\vx) - \sigma_p(\vx)|<\epsilon\;\forall\;\vx\in\mX$, and the prior and posterior process share the same aleatoric variance $\sigma^2$, then
$$
|\tilde{d}_q - \tilde{d}_p| \leq N\sigma^{-2}\epsilon.
$$
Relevance to foundation models / generative AI.
Distributions over actions
Lower-bound posterior iteration
CoRL (2022)
Distributions over behaviours
Coherent soft imitation learning
NeurIPS (2023)