## Exact GPR Method

An instance of response y from a Gaussian process regression (GPR) model can be modeled as

Hence, making predictions for new data from a GPR model requires:

• Knowledge of the coefficient vector, $\beta$, of fixed basis functions

• Ability to evaluate the covariance function $k\left(x,{x}^{\prime }|\theta \right)$ for arbitrary $x$ and ${x}^{\prime }$, given the kernel parameters or hyperparameters, $\theta$.

• Knowledge of the noise variance ${\sigma }^{2}$ that appears in the density $P\left({y}_{i}|f\left({x}_{i}\right),{x}_{i}\right)$

That is, one needs first to estimate $\beta$, $\theta$, and ${\sigma }^{2}$ from the data $\left(X,y\right)$.

### Parameter Estimation

One approach for estimating the parameters $\beta$, $\theta$, and ${\sigma }^{2}$ of a GPR model is by maximizing the likelihood $P\left(y|X\right)$ as a function of $\beta$, $\theta$, and ${\sigma }^{2}$[1]. That is, if $\stackrel{^}{\beta }$, $\stackrel{^}{\theta }$, and ${\stackrel{^}{\sigma }}^{2}$ are the estimates of $\beta$, $\theta$, and ${\sigma }^{2}$, respectively, then:

Because

`$P\left(y|X\right)=P\left(y|X,\beta ,\theta ,{\sigma }^{2}\right)=\mathcal{N}\left(y|H\beta ,K\left(X,X|\theta \right)+{\sigma }^{2}{I}_{n}\right),$`

the marginal log likelihood function is as follows:

`$\begin{array}{ll}\mathrm{log}P\left(y|X,\beta ,\theta ,{\sigma }^{2}\right)=\hfill & -\frac{1}{2}{\left(y-H\beta \right)}^{T}{\left[K\left(X,X|\theta \right)+{\sigma }^{2}{I}_{n}\right]}^{-1}\left(y-H\beta \right)\hfill \\ \hfill & -\frac{n}{2}\mathrm{log}2\pi -\frac{1}{2}\mathrm{log}|K\left(X,X|\theta \right)+{\sigma }^{2}{I}_{n}|.\hfill \end{array}$`

where $H$ is the vector of explicit basis functions, and $K\left(X,X|\theta \right)$ is the covariance function matrix (for more information, see Gaussian Process Regression Models).

To estimate the parameters, the software first computes $\stackrel{^}{\beta }\left(\theta ,{\sigma }^{2}\right)$, which maximizes the log likelihood function with respect to $\beta$ for given $\theta$ and ${\sigma }^{2}$. It then uses this estimate to compute the $\beta$-profiled likelihood:

`$\mathrm{log}\left\{P\left(y|X,\stackrel{^}{\beta }\left(\theta ,{\sigma }^{2}\right),\theta ,{\sigma }^{2}\right)\right\}.$`

The estimate of $\beta$ for given $\theta$, and ${\sigma }^{2}$ is

Then, the $\beta$-profiled log likelihood is given by

`$\begin{array}{ll}\mathrm{log}P\left(y|X,\stackrel{^}{\beta }\left(\theta ,{\sigma }^{2}\right),\theta ,{\sigma }^{2}\right)=\hfill & -\frac{1}{2}{\left(y-H\stackrel{^}{\beta }\left(\theta ,{\sigma }^{2}\right)\right)}^{T}{\left[K\left(X,X|\theta \right)+{\sigma }^{2}{I}_{n}\right]}^{-1}\left(y-H\stackrel{^}{\beta }\left(\theta ,{\sigma }^{2}\right)\right)\hfill \\ \hfill & -\frac{n}{2}\mathrm{log}2\pi -\frac{1}{2}\mathrm{log}|K\left(X,X|\theta \right)+{\sigma }^{2}{I}_{n}|\hfill \end{array}$`

The software then maximizes the $\beta$-profiled log-likelihood over $\theta$, and ${\sigma }^{2}$ to find their estimates.

### Prediction

Making probabilistic predictions from a GPR model with known parameters requires the density $P\left({y}_{new}|y,X,{x}_{new}\right)$. Using the definition of conditional probabilities, one can write:

`$P\left({y}_{new}|y,X,{x}_{new}\right)=\frac{P\left({y}_{new},y|X,{x}_{new}\right)}{P\left(y|X,{x}_{new}\right)}.$`

To find the joint density in the numerator, it is necessary to introduce the latent variables ${f}_{new}$ and $f$ corresponding to ${y}_{new}$, and $y$, respectively. Then, it is possible to use the joint distribution for ${y}_{new}$, $y$, ${f}_{new}$, and $f$ to compute $P\left({y}_{new},y|X,{x}_{new}\right)$:

`$\begin{array}{l}\begin{array}{ll}P\left({y}_{new},y|X,{x}_{new}\right)\hfill & =\int \int P\left({y}_{new},y,{f}_{new},f|X,{x}_{new}\right)dfd{f}_{new}\hfill \\ \hfill & =\int \int P\left({y}_{new},y|{f}_{new},f,X,{x}_{new}\right)P\left({f}_{new},f|X,{x}_{new}\right)dfd{f}_{new}.\hfill \end{array}\\ \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{\hspace{0.17em}}\text{\hspace{0.17em}}\end{array}$`

Gaussian process models assume that each response ${y}_{i}$ only depends on the corresponding latent variable ${f}_{i}$ and the feature vector ${x}_{i}$. Writing $P\left({y}_{new},y|{f}_{new},f,X,{x}_{new}\right)$ as a product of conditional densities and based on this assumption produces:

`$\begin{array}{l}P\left({y}_{new},y|{f}_{new},f,X,{x}_{new}\right)=P\left({y}_{new}|{f}_{new},{x}_{new}\right)\prod _{i=1}^{n}P\left({y}_{i}|f\left({x}_{i}\right),{x}_{i}\right)\hfill \end{array}.$`

After integrating with respect to ${y}_{new}$, the result only depends on $f$ and $X$:

`$\begin{array}{l}P\left(y|f,X\right)=\prod _{i=1}^{n}P\left({y}_{i}|{f}_{i},{x}_{i}\right)=\prod _{i=1}^{n}\mathcal{N}\left({y}_{i}{|h\left({x}_{i}\right)}^{T}\beta +{f}_{i},{\sigma }^{2}\right)\hfill \end{array}.$`

Hence,

Again using the definition of conditional probabilities,

`$P\left({f}_{new},f|X,{x}_{new}\right)=P\left({f}_{new}|f,X,{x}_{new}\right)*P\left(f|X,{x}_{new}\right),$`

it is possible to write $P\left({y}_{new},y|X,{x}_{new}\right)$ as follows:

Using the facts that

`$P\left(f|X,{x}_{new}\right)=P\left(f|X\right)$`

and

`$P\left(y|f,X\right)P\left(f|X\right)=P\left(y,f|X\right)=P\left(f|y,X\right)P\left(y|X\right),$`

one can rewrite $P\left({y}_{new},y|X,{x}_{new}\right)$ as follows:

It is also possible to show that

`$P\left(y|X,{x}_{new}\right)=P\left(y|X\right).$`

Hence, the required density $P\left({y}_{new}|y,X,{x}_{new}\right)$ is:

It can be shown that

`$\left(1\right)\text{ }P\left({y}_{new}|{f}_{new},{x}_{new}\right)=\mathcal{N}\left({y}_{new}|h{\left({x}_{new}\right)}^{T}\beta +{f}_{new},{\sigma }_{new}^{2}\right)$`
`$\left(2\right)\text{ }P\left(f|y,X\right)=\mathcal{N}\left(f|\frac{1}{{\sigma }^{2}}{\left(\frac{{I}_{n}}{{\sigma }^{2}}+K{\left(X,X\right)}^{-1}\right)}^{-1}\left(y-H\beta \right),{\left(\frac{{I}_{n}}{{\sigma }^{2}}+K{\left(X,X\right)}^{-1}\right)}^{-1}\right)$`

After the integration and required algebra, the density of the new response ${y}_{new}$ at a new point ${x}_{new}$, given $y$, $X$ is found as

`$P\left({y}_{new}|y,X,{x}_{new}\right)=\mathcal{N}\left({y}_{new}|h{\left({x}_{new}\right)}^{T}\beta +\mu ,{\sigma }_{new}^{2}+\Sigma \right),$`

where

`$\mu =K\left({x}_{new}^{T},X\right)\underset{\alpha }{\underbrace{{\left(K\left(X,X\right)+{\sigma }^{2}{I}_{n}\right)}^{-1}\left(y-H\beta \right)}}$`

and

`$\Sigma =k\left({x}_{new},{x}_{new}\right)-K\left({x}_{new}^{T},X\right){\left(K\left(X,X\right)+{\sigma }^{2}{I}_{n}\right)}^{-1}K\left(X,{x}_{new}^{T}\right).$`

The expected value of prediction ${y}_{new}$ at a new point ${x}_{new}$ given $y$, $X$, and parameters $\beta$, $\theta$, and ${\sigma }^{2}$ is

where

`$\alpha ={\left(K\left(X,X|\theta \right)+{\sigma }^{2}{I}_{n}\right)}^{-1}\left(y-H\beta \right).$`

### Computational Complexity of Exact Parameter Estimation and Prediction

Training a GPR model with the exact method (when `FitMethod` is `'Exact'`) requires the inversion of an n-by-n kernel matrix $K\left(X,X\right)$. The memory requirement for this step scales as O(n^2) since $K\left(X,X\right)$ must be stored in memory. One evaluation of $\mathrm{log}P\left(y|X\right)$ scales as O(n^3). Therefore, the computational complexity is O(k*n^3), where k is the number of function evaluations needed for maximization and n is the number of observations.

Making predictions on new data involves the computation of $\stackrel{^}{\alpha }$. If prediction intervals are desired, this step could also involve the computation and storage of the Cholesky factor of $\left(K\left(X,X\right)+{\sigma }^{2}{I}_{n}\right)$ for later use. The computational complexity of this step using the direct computation of $\stackrel{^}{\alpha }$ is O(n^3) and the memory requirement is O(n^2).

Hence, for large n, estimation of parameters or computing predictions can be very expensive. The approximation methods usually involve rearranging the computation so as to avoid the inversion of an n-by-n matrix. For the available approximation methods, please see the related links at the bottom of the page.

## References

[1] Rasmussen, C. E. and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press. Cambridge, Massachusetts, 2006.