机器学习的高斯过程¶

# 15 机器学习的高斯过程

关于[`sklearn.gaussian_process`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.gaussian_process)模块的例子

## 15.1 XOR数据集上高斯过程分类(GPC)的说明

此示例说明XOR数据上的GPC。比较了平稳、各向同性核(RBF)和非平稳核(DotProduct)。在这个特定的数据集上，DotProduct内核获得了更好的结果，因为类边界是线性的，并且与坐标轴重合。一般来说，平稳的核通常能得到更好的结果。

```python
print(__doc__)

# Authors: Jan Hendrik Metzen <jhm@informatik.uni-bremen.de>
#
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt

from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF, DotProduct

xx, yy = np.meshgrid(np.linspace(-3, 3, 50),
                     np.linspace(-3, 3, 50))
rng = np.random.RandomState(0)
X = rng.randn(200, 2)
Y = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)

# fit the model
plt.figure(figsize=(10, 5))
kernels = [1.0 * RBF(length_scale=1.0), 1.0 * DotProduct(sigma_0=1.0)**2]
for i, kernel in enumerate(kernels):
    clf = GaussianProcessClassifier(kernel=kernel, warm_start=True).fit(X, Y)

# plot the decision function for each datapoint on the grid
    Z = clf.predict_proba(np.vstack((xx.ravel(), yy.ravel())).T)[:, 1]
    Z = Z.reshape(xx.shape)

plt.subplot(1, 2, i + 1)
    image = plt.imshow(Z, interpolation='nearest',
                       extent=(xx.min(), xx.max(), yy.min(), yy.max()),
                       aspect='auto', origin='lower', cmap=plt.cm.PuOr_r)
    contours = plt.contour(xx, yy, Z, levels=[0.5], linewidths=2,
                           colors=['k'])
    plt.scatter(X[:, 0], X[:, 1], s=30, c=Y, cmap=plt.cm.Paired,
                edgecolors=(0, 0, 0))
    plt.xticks(())
    plt.yticks(())
    plt.axis([-3, 3, -3, 3])
    plt.colorbar(image)
    plt.title("%s\n Log-Marginal-Likelihood:%.3f"
              % (clf.kernel_, clf.log_marginal_likelihood(clf.kernel_.theta)),
              fontsize=12)

plt.tight_layout()
plt.show()
```

**脚本的总运行时间**：(0分0.569秒)

[`Download Python source code: plot_gpc_xor.py`](https://scikit-learn.org/stable/_downloads/1edbd653105d166b25f5b13a7f98a577/plot_gpc_xor.py)

[`Download Jupyter notebook: plot_gpc_xor.ipynb`](https://scikit-learn.org/stable/_downloads/51ae591a054d4b4f05fe0661f913d3fa/plot_gpc_xor.ipynb)

## 15.2 在iris数据集上的高斯过程分类

这个例子说明了在二维版本的iris数据集上，各向同性和各向异性径向基函数核(RBF)的GPC的预测概率。各向异性径向基函数核通过给两个特征维分配不同的长度尺度，得到略高的对数边际似然。

```python
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
y = np.array(iris.target, dtype=int)

h = .02  # step size in the mesh

kernel = 1.0 * RBF([1.0])
gpc_rbf_isotropic = GaussianProcessClassifier(kernel=kernel).fit(X, y)
kernel = 1.0 * RBF([1.0, 1.0])
gpc_rbf_anisotropic = GaussianProcessClassifier(kernel=kernel).fit(X, y)

# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

titles = ["Isotropic RBF", "Anisotropic RBF"]
plt.figure(figsize=(10, 5))
for i, clf in enumerate((gpc_rbf_isotropic, gpc_rbf_anisotropic)):
    # Plot the predicted probabilities. For that, we will assign a color to
    # each point in the mesh [x_min, m_max]x[y_min, y_max].
    plt.subplot(1, 2, i + 1)

Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
    Z = Z.reshape((xx.shape[0], xx.shape[1], 3))
    plt.imshow(Z, extent=(x_min, x_max, y_min, y_max), origin="lower")

# Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=np.array(["r", "g", "b"])[y],
                edgecolors=(0, 0, 0))
    plt.xlabel('Sepal length')
    plt.ylabel('Sepal width')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title("%s, LML: %.3f" %
              (titles[i], clf.log_marginal_likelihood(clf.kernel_.theta)))

plt.tight_layout()
plt.show()
```

**脚本的总运行时间**：(0分3.967秒)

[`Download Python source code: plot_gpc_iris.py`](https://scikit-learn.org/stable/_downloads/482816438fdebd2ee58cf1097e5aa519/plot_gpc_iris.py)

[`Download Jupyter notebook: plot_gpc_iris.ipynb`](https://scikit-learn.org/stable/_downloads/0314576f362d54c36e37904494f4c7d9/plot_gpc_iris.ipynb)

## 15.3 核岭回归与高斯过程回归的比较

核岭回归(KRR)和高斯过程回归(GPR)都是通过内部使用“核技巧”来学习目标函数的。KRR在相应的核催生的空间中学习一个线性函数，该函数对应于原始空间中的一个非线性函数。线性函数的选择是基于带岭正则的军方误差损失。GPR使用核来定义目标函数的先验分布的协方差，并使用观测到的训练数据来定义似然函数。基于Bayes定理，定义了目标函数上的(高斯)后验分布，其均值用于预测。

一个主要的区别是，GPR可以基于边际似然函数的梯度上升来选择核的超参数，而KRR则需要对交叉验证的损失函数(均方误差损失)执行网格搜索。另一个不同之处是，GPR学习目标函数的生成概率模型，因此可以提供有意义的置信区间和后验样本以及预测，而KRR只提供预测。

此示例说明了人工数据集上的这两种方法，该数据集由一个正弦目标函数和强噪声组成。图中比较了KRR和GPR学习到模型， 基于的是ExpSineSquared核， 这个核适用于学习周期函数。核的超参数控制核的光滑性(L)和周期性(P)。此外，数据的噪声水平是由GPR通过内核中附加的WhiteKernel成分和KRR的正则化参数α显式地学习的。

此图显示，这两种方法都学习了目标函数的合理模型。GPR正确地识别出函数的周期约为2\*pi(6.28)，而KRR选择的双倍的周期为4*pi。此外， GPR为预测提供了合理的置信界限， 但是KRR没有不能获取。这两种方法的一个主要区别是拟合和预测所需的时间：虽然拟合KRR在原则上是快速的，但网格搜索的超参数优化规模与超参数的数量成指数关系(“维数诅咒”)。在GPR中基于梯度的参数优化不会遭受这种指数尺度的影响， 因此在这个具有三维超参数空间的例子中，速度要快得多。然而，预测的时间是相似的，GPR产生预测分布的方差要比仅仅预测平均值花费的时间要长得多。

```PYTHON
Time for KRR fitting: 4.593
Time for GPR fitting: 0.105
Time for KRR prediction: 0.054
Time for GPR prediction: 0.078
Time for GPR prediction with standard-deviation: 0.084
```

```python
print(__doc__)

# Authors: Jan Hendrik Metzen <jhm@informatik.uni-bremen.de>
# License: BSD 3 clause

import time

import numpy as np

import matplotlib.pyplot as plt

from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, ExpSineSquared

rng = np.random.RandomState(0)

# Generate sample data
X = 15 * rng.rand(100, 1)
y = np.sin(X).ravel()
y += 3 * (0.5 - rng.rand(X.shape[0]))  # add noise

# Fit KernelRidge with parameter selection based on 5-fold cross validation
param_grid = {"alpha": [1e0, 1e-1, 1e-2, 1e-3],
              "kernel": [ExpSineSquared(l, p)
                         for l in np.logspace(-2, 2, 10)
                         for p in np.logspace(0, 2, 10)]}
kr = GridSearchCV(KernelRidge(), param_grid=param_grid)
stime = time.time()
kr.fit(X, y)
print("Time for KRR fitting: %.3f" % (time.time() - stime))

gp_kernel = ExpSineSquared(1.0, 5.0, periodicity_bounds=(1e-2, 1e1)) \
    + WhiteKernel(1e-1)
gpr = GaussianProcessRegressor(kernel=gp_kernel)
stime = time.time()
gpr.fit(X, y)
print("Time for GPR fitting: %.3f" % (time.time() - stime))

# Predict using kernel ridge
X_plot = np.linspace(0, 20, 10000)[:, None]
stime = time.time()
y_kr = kr.predict(X_plot)
print("Time for KRR prediction: %.3f" % (time.time() - stime))

# Predict using gaussian process regressor
stime = time.time()
y_gpr = gpr.predict(X_plot, return_std=False)
print("Time for GPR prediction: %.3f" % (time.time() - stime))

stime = time.time()
y_gpr, y_std = gpr.predict(X_plot, return_std=True)
print("Time for GPR prediction with standard-deviation: %.3f"
      % (time.time() - stime))

# Plot results
plt.figure(figsize=(10, 5))
lw = 2
plt.scatter(X, y, c='k', label='data')
plt.plot(X_plot, np.sin(X_plot), color='navy', lw=lw, label='True')
plt.plot(X_plot, y_kr, color='turquoise', lw=lw,
         label='KRR (%s)' % kr.best_params_)
plt.plot(X_plot, y_gpr, color='darkorange', lw=lw,
         label='GPR (%s)' % gpr.kernel_)
plt.fill_between(X_plot[:, 0], y_gpr - y_std, y_gpr + y_std, color='darkorange',
                 alpha=0.2)
plt.xlabel('data')
plt.ylabel('target')
plt.xlim(0, 20)
plt.ylim(-4, 4)
plt.title('GPR versus Kernel Ridge')
plt.legend(loc="best",  scatterpoints=1, prop={'size': 8})
plt.show()
```

**脚本的总运行时间**：(0分5.065秒)

[`Download Python source code: plot_compare_gpr_krr.py`](https://scikit-learn.org/stable/_downloads/0e50e21d51dd40e8e6126d72713b794d/plot_compare_gpr_krr.py)

[`Download Jupyter notebook: plot_compare_gpr_krr.ipynb`](https://scikit-learn.org/stable/_downloads/afcbb9da0f65f832f7e87652c6007641/plot_compare_gpr_krr.ipynb)

## 15.4 不同核的高斯过程的先验和后验示例

这个例子说明了具有不同内核的GPR的先验和后验。平均值，标准差，和10个样本都显示对于先验和后验。

```python
/home/circleci/project/sklearn/gaussian_process/_gpr.py:504: ConvergenceWarning: lbfgs failed to converge (status=2):
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
/home/circleci/project/sklearn/gaussian_process/_gpr.py:370: UserWarning: Predicted variances smaller than 0. Setting those variances to 0.
  warnings.warn("Predicted variances smaller than 0. "
```

```python
print(__doc__)

# Authors: Jan Hendrik Metzen <jhm@informatik.uni-bremen.de>
#
# License: BSD 3 clause

import numpy as np

from matplotlib import pyplot as plt

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import (RBF, Matern, RationalQuadratic,
                                              ExpSineSquared, DotProduct,
                                              ConstantKernel)

kernels = [1.0 * RBF(length_scale=1.0, length_scale_bounds=(1e-1, 10.0)),
           1.0 * RationalQuadratic(length_scale=1.0, alpha=0.1),
           1.0 * ExpSineSquared(length_scale=1.0, periodicity=3.0,
                                length_scale_bounds=(0.1, 10.0),
                                periodicity_bounds=(1.0, 10.0)),
           ConstantKernel(0.1, (0.01, 10.0))
               * (DotProduct(sigma_0=1.0, sigma_0_bounds=(0.1, 10.0)) ** 2),
           1.0 * Matern(length_scale=1.0, length_scale_bounds=(1e-1, 10.0),
                        nu=1.5)]

for kernel in kernels:
    # Specify Gaussian Process
    gp = GaussianProcessRegressor(kernel=kernel)

# Plot prior
    plt.figure(figsize=(8, 8))
    plt.subplot(2, 1, 1)
    X_ = np.linspace(0, 5, 100)
    y_mean, y_std = gp.predict(X_[:, np.newaxis], return_std=True)
    plt.plot(X_, y_mean, 'k', lw=3, zorder=9)
    plt.fill_between(X_, y_mean - y_std, y_mean + y_std,
                     alpha=0.2, color='k')
    y_samples = gp.sample_y(X_[:, np.newaxis], 10)
    plt.plot(X_, y_samples, lw=1)
    plt.xlim(0, 5)
    plt.ylim(-3, 3)
    plt.title("Prior (kernel:  %s)" % kernel, fontsize=12)

# Generate data and fit GP
    rng = np.random.RandomState(4)
    X = rng.uniform(0, 5, 10)[:, np.newaxis]
    y = np.sin((X[:, 0] - 2.5) ** 2)
    gp.fit(X, y)

# Plot posterior
    plt.subplot(2, 1, 2)
    X_ = np.linspace(0, 5, 100)
    y_mean, y_std = gp.predict(X_[:, np.newaxis], return_std=True)
    plt.plot(X_, y_mean, 'k', lw=3, zorder=9)
    plt.fill_between(X_, y_mean - y_std, y_mean + y_std,
                     alpha=0.2, color='k')

y_samples = gp.sample_y(X_[:, np.newaxis], 10)
    plt.plot(X_, y_samples, lw=1)
    plt.scatter(X[:, 0], y, c='r', s=50, zorder=10, edgecolors=(0, 0, 0))
    plt.xlim(0, 5)
    plt.ylim(-3, 3)
    plt.title("Posterior (kernel: %s)\n Log-Likelihood: %.3f"
              % (gp.kernel_, gp.log_marginal_likelihood(gp.kernel_.theta)),
              fontsize=12)
    plt.tight_layout()

plt.show()
```

**脚本的总运行时间**：(0分1.299秒)

[`Download Python source code: plot_gpr_prior_posterior.py`](https://scikit-learn.org/stable/_downloads/581ff73109cc436844bb4417da6078e8/plot_gpr_prior_posterior.py)

[`Download Jupyter notebook: plot_gpr_prior_posterior.ipynb`](https://scikit-learn.org/stable/_downloads/c7e75363a4b293ce9f52f6a681aae602/plot_gpr_prior_posterior.ipynb)

## 15.5 高斯过程分类的等概率线(GPC)

一个二维分类示例，显示了预测概率的等概率线。

```python
Learned kernel: 0.0256**2 * DotProduct(sigma_0=5.72) ** 2
```

```python
print(__doc__)

# Author: Vincent Dubourg <vincent.dubourg@gmail.com>
# Adapted to GaussianProcessClassifier:
#         Jan Hendrik Metzen <jhm@informatik.uni-bremen.de>
# License: BSD 3 clause

import numpy as np

from matplotlib import pyplot as plt
from matplotlib import cm

from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import DotProduct, ConstantKernel as C

# A few constants
lim = 8

def g(x):
    """The function to predict (classification will then consist in predicting
    whether g(x) <= 0 or not)"""
    return 5. - x[:, 1] - .5 * x[:, 0] ** 2.

# Design of experiments
X = np.array([[-4.61611719, -6.00099547],
              [4.10469096, 5.32782448],
              [0.00000000, -0.50000000],
              [-6.17289014, -4.6984743],
              [1.3109306, -6.93271427],
              [-5.03823144, 3.10584743],
              [-2.87600388, 6.74310541],
              [5.21301203, 4.26386883]])

# Observations
y = np.array(g(X) > 0, dtype=int)

# Instantiate and fit Gaussian Process Model
kernel = C(0.1, (1e-5, np.inf)) * DotProduct(sigma_0=0.1) ** 2
gp = GaussianProcessClassifier(kernel=kernel)
gp.fit(X, y)
print("Learned kernel: %s " % gp.kernel_)

# Evaluate real function and the predicted probability
res = 50
x1, x2 = np.meshgrid(np.linspace(- lim, lim, res),
                     np.linspace(- lim, lim, res))
xx = np.vstack([x1.reshape(x1.size), x2.reshape(x2.size)]).T

y_true = g(xx)
y_prob = gp.predict_proba(xx)[:, 1]
y_true = y_true.reshape((res, res))
y_prob = y_prob.reshape((res, res))

# Plot the probabilistic classification iso-values
fig = plt.figure(1)
ax = fig.gca()
ax.axes.set_aspect('equal')
plt.xticks([])
plt.yticks([])
ax.set_xticklabels([])
ax.set_yticklabels([])
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')

cax = plt.imshow(y_prob, cmap=cm.gray_r, alpha=0.8,
                 extent=(-lim, lim, -lim, lim))
norm = plt.matplotlib.colors.Normalize(vmin=0., vmax=0.9)
cb = plt.colorbar(cax, ticks=[0., 0.2, 0.4, 0.6, 0.8, 1.], norm=norm)
cb.set_label(r'${\rm \mathbb{P}}\left[\widehat{G}(\mathbf{x}) \leq 0\right]$')
plt.clim(0, 1)

plt.plot(X[y <= 0, 0], X[y <= 0, 1], 'r.', markersize=12)

plt.plot(X[y > 0, 0], X[y > 0, 1], 'b.', markersize=12)

plt.contour(x1, x2, y_true, [0.], colors='k', linestyles='dashdot')

cs = plt.contour(x1, x2, y_prob, [0.666], colors='b',
                 linestyles='solid')
plt.clabel(cs, fontsize=11)

cs = plt.contour(x1, x2, y_prob, [0.5], colors='k',
                 linestyles='dashed')
plt.clabel(cs, fontsize=11)

cs = plt.contour(x1, x2, y_prob, [0.334], colors='r',
                 linestyles='solid')
plt.clabel(cs, fontsize=11)

plt.show()
```

**脚本的总运行时间**：(0分0.201秒)

[`Download Python source code: plot_gpc_isoprobability.py`](https://scikit-learn.org/stable/_downloads/2f57d8a0d076d76a7a313636b4239a95/plot_gpc_isoprobability.py)

[`Download Jupyter notebook: plot_gpc_isoprobability.ipynb`](https://scikit-learn.org/stable/_downloads/97bb7d5b2d2ee600130fc2d23720eecf/plot_gpc_isoprobability.ipynb)

## 15.6 高斯过程分类的概率预测

这个例子说明了在不同的超参数选择下，RBF核的GPC的预测概率。第一个图显示了具有任意选择的超参数和对应于最大对数边际似然(LML)的超参数的GPC的预测概率。

虽然通过优化LML选择的超参数具有相当大的LML，但根据测试数据的日志丢失情况，它们的性能略差一些。图中显示，这是因为它们在类边界上出现了类概率的急剧变化(这是好的)，但是预测的概率接近于离类边界近0.5(这是不好的)，这种不良影响是由GPC内部使用的Laplace近似引起的。

第二个图显示了内核的超参数的不同选择的对数边际似然，突出显示了在第一个图形中使用黑点的两个超参数选择。

```python
Log Marginal Likelihood (initial): -17.598
Log Marginal Likelihood (optimized): -3.875
Accuracy: 1.000 (initial) 1.000 (optimized)
Log-loss: 0.214 (initial) 0.319 (optimized)
```

```python
print(__doc__)

# Authors: Jan Hendrik Metzen <jhm@informatik.uni-bremen.de>
#
# License: BSD 3 clause

import numpy as np

from matplotlib import pyplot as plt

from sklearn.metrics import accuracy_score, log_loss
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

# Generate data
train_size = 50
rng = np.random.RandomState(0)
X = rng.uniform(0, 5, 100)[:, np.newaxis]
y = np.array(X[:, 0] > 2.5, dtype=int)

# Specify Gaussian Processes with fixed and optimized hyperparameters
gp_fix = GaussianProcessClassifier(kernel=1.0 * RBF(length_scale=1.0),
                                   optimizer=None)
gp_fix.fit(X[:train_size], y[:train_size])

gp_opt = GaussianProcessClassifier(kernel=1.0 * RBF(length_scale=1.0))
gp_opt.fit(X[:train_size], y[:train_size])

print("Log Marginal Likelihood (initial): %.3f"
      % gp_fix.log_marginal_likelihood(gp_fix.kernel_.theta))
print("Log Marginal Likelihood (optimized): %.3f"
      % gp_opt.log_marginal_likelihood(gp_opt.kernel_.theta))

print("Accuracy: %.3f (initial) %.3f (optimized)"
      % (accuracy_score(y[:train_size], gp_fix.predict(X[:train_size])),
         accuracy_score(y[:train_size], gp_opt.predict(X[:train_size]))))
print("Log-loss: %.3f (initial) %.3f (optimized)"
      % (log_loss(y[:train_size], gp_fix.predict_proba(X[:train_size])[:, 1]),
         log_loss(y[:train_size], gp_opt.predict_proba(X[:train_size])[:, 1])))

# Plot posteriors
plt.figure()
plt.scatter(X[:train_size, 0], y[:train_size], c='k', label="Train data",
            edgecolors=(0, 0, 0))
plt.scatter(X[train_size:, 0], y[train_size:], c='g', label="Test data",
            edgecolors=(0, 0, 0))
X_ = np.linspace(0, 5, 100)
plt.plot(X_, gp_fix.predict_proba(X_[:, np.newaxis])[:, 1], 'r',
         label="Initial kernel: %s" % gp_fix.kernel_)
plt.plot(X_, gp_opt.predict_proba(X_[:, np.newaxis])[:, 1], 'b',
         label="Optimized kernel: %s" % gp_opt.kernel_)
plt.xlabel("Feature")
plt.ylabel("Class 1 probability")
plt.xlim(0, 5)
plt.ylim(-0.25, 1.5)
plt.legend(loc="best")

# Plot LML landscape
plt.figure()
theta0 = np.logspace(0, 8, 30)
theta1 = np.logspace(-1, 1, 29)
Theta0, Theta1 = np.meshgrid(theta0, theta1)
LML = [[gp_opt.log_marginal_likelihood(np.log([Theta0[i, j], Theta1[i, j]]))
        for i in range(Theta0.shape[0])] for j in range(Theta0.shape[1])]
LML = np.array(LML).T
plt.plot(np.exp(gp_fix.kernel_.theta)[0], np.exp(gp_fix.kernel_.theta)[1],
         'ko', zorder=10)
plt.plot(np.exp(gp_opt.kernel_.theta)[0], np.exp(gp_opt.kernel_.theta)[1],
         'ko', zorder=10)
plt.pcolor(Theta0, Theta1, LML)
plt.xscale("log")
plt.yscale("log")
plt.colorbar()
plt.xlabel("Magnitude")
plt.ylabel("Length-scale")
plt.title("Log-marginal-likelihood")

plt.show()
```

**脚本的总运行时间**：(0分3.016秒)

[`Download Python source code: plot_gpc.py`](https://scikit-learn.org/stable/_downloads/28cc51eb795f5d3f1d2b277cec0fc093/plot_gpc.py)

[`Download Jupyter notebook: plot_gpc.ipynb`](https://scikit-learn.org/stable/_downloads/d81d90dfa14b996cbc250ffb328a3074/plot_gpc.ipynb)

## 15.7 带噪声水平估计的高斯过程回归(GPR)

此示例说明具有和核(包括WhiteKernel)的CPR可以估计数据的噪声水平。对数边际似然(LML)的图的表明，LML存在两个局部极大值。第一个对应于一个高噪声级和大长度尺度的模型，它解释了数据中的所有噪声引起的变化。第二个噪声水平较小，长度尺度较短，解释了大部分的变化都是由无噪音的函数关系造成的。第二种模型具有较高的似然，然而，根据超参数的初始值，基于梯度的优化也可能收敛到高噪声的解。因此，对于不同的初始化，多次重复优化是很重要的。

```python
print(__doc__)

# Authors: Jan Hendrik Metzen <jhm@informatik.uni-bremen.de>
#
# License: BSD 3 clause

import numpy as np

from matplotlib import pyplot as plt
from matplotlib.colors import LogNorm

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel

rng = np.random.RandomState(0)
X = rng.uniform(0, 5, 20)[:, np.newaxis]
y = 0.5 * np.sin(3 * X[:, 0]) + rng.normal(0, 0.5, X.shape[0])

# First run
plt.figure()
kernel = 1.0 * RBF(length_scale=100.0, length_scale_bounds=(1e-2, 1e3)) \
    + WhiteKernel(noise_level=1, noise_level_bounds=(1e-10, 1e+1))
gp = GaussianProcessRegressor(kernel=kernel,
                              alpha=0.0).fit(X, y)
X_ = np.linspace(0, 5, 100)
y_mean, y_cov = gp.predict(X_[:, np.newaxis], return_cov=True)
plt.plot(X_, y_mean, 'k', lw=3, zorder=9)
plt.fill_between(X_, y_mean - np.sqrt(np.diag(y_cov)),
                 y_mean + np.sqrt(np.diag(y_cov)),
                 alpha=0.5, color='k')
plt.plot(X_, 0.5*np.sin(3*X_), 'r', lw=3, zorder=9)
plt.scatter(X[:, 0], y, c='r', s=50, zorder=10, edgecolors=(0, 0, 0))
plt.title("Initial: %s\nOptimum: %s\nLog-Marginal-Likelihood: %s"
          % (kernel, gp.kernel_,
             gp.log_marginal_likelihood(gp.kernel_.theta)))
plt.tight_layout()

# Second run
plt.figure()
kernel = 1.0 * RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e3)) \
    + WhiteKernel(noise_level=1e-5, noise_level_bounds=(1e-10, 1e+1))
gp = GaussianProcessRegressor(kernel=kernel,
                              alpha=0.0).fit(X, y)
X_ = np.linspace(0, 5, 100)
y_mean, y_cov = gp.predict(X_[:, np.newaxis], return_cov=True)
plt.plot(X_, y_mean, 'k', lw=3, zorder=9)
plt.fill_between(X_, y_mean - np.sqrt(np.diag(y_cov)),
                 y_mean + np.sqrt(np.diag(y_cov)),
                 alpha=0.5, color='k')
plt.plot(X_, 0.5*np.sin(3*X_), 'r', lw=3, zorder=9)
plt.scatter(X[:, 0], y, c='r', s=50, zorder=10, edgecolors=(0, 0, 0))
plt.title("Initial: %s\nOptimum: %s\nLog-Marginal-Likelihood: %s"
          % (kernel, gp.kernel_,
             gp.log_marginal_likelihood(gp.kernel_.theta)))
plt.tight_layout()

# Plot LML landscape
plt.figure()
theta0 = np.logspace(-2, 3, 49)
theta1 = np.logspace(-2, 0, 50)
Theta0, Theta1 = np.meshgrid(theta0, theta1)
LML = [[gp.log_marginal_likelihood(np.log([0.36, Theta0[i, j], Theta1[i, j]]))
        for i in range(Theta0.shape[0])] for j in range(Theta0.shape[1])]
LML = np.array(LML).T

vmin, vmax = (-LML).min(), (-LML).max()
vmax = 50
level = np.around(np.logspace(np.log10(vmin), np.log10(vmax), 50), decimals=1)
plt.contour(Theta0, Theta1, -LML,
            levels=level, norm=LogNorm(vmin=vmin, vmax=vmax))
plt.colorbar()
plt.xscale("log")
plt.yscale("log")
plt.xlabel("Length-scale")
plt.ylabel("Noise-level")
plt.title("Log-marginal-likelihood")
plt.tight_layout()

plt.show()
```

**脚本的总运行时间**：(0分3.520秒)

[`Download Python source code: plot_gpr_noisy.py`](https://scikit-learn.org/stable/_downloads/25c15021b0d52908c9174bd55e9fb68e/plot_gpr_noisy.py)

[`Download Jupyter notebook: plot_gpr_noisy.ipynb`](https://scikit-learn.org/stable/_downloads/5ad621a93d8c0c9d041443f21775d9c0/plot_gpr_noisy.ipynb)

## 15.8 高斯过程回归：基本介绍性示例

一个简单的一维回归示例以两种不同的方式计算：

1. 无噪声的情况
2. 每个数据点都具有已知噪声级的噪声情况

在这两种情况下，核参数的估计使用最大似然原理。

图中以点态95%置信区间的形式说明了高斯过程模型的插值性质及其概率性质。

注意，参数alpha作为训练点之间假定协方差的Tikhonov正则化。

```python
print(__doc__)

# Author: Vincent Dubourg <vincent.dubourg@gmail.com>
#         Jake Vanderplas <vanderplas@astro.washington.edu>
#         Jan Hendrik Metzen <jhm@informatik.uni-bremen.de>s
# License: BSD 3 clause

import numpy as np
from matplotlib import pyplot as plt

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C

np.random.seed(1)

def f(x):
    """The function to predict."""
    return x * np.sin(x)

# ----------------------------------------------------------------------
#  First the noiseless case
X = np.atleast_2d([1., 3., 5., 6., 7., 8.]).T

# Observations
y = f(X).ravel()

# Mesh the input space for evaluations of the real function, the prediction and
# its MSE
x = np.atleast_2d(np.linspace(0, 10, 1000)).T

# Instantiate a Gaussian Process model
kernel = C(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)

# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, y)

# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred, sigma = gp.predict(x, return_std=True)

# Plot the function, the prediction and the 95% confidence interval based on
# the MSE
plt.figure()
plt.plot(x, f(x), 'r:', label=r'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'r.', markersize=10, label='Observations')
plt.plot(x, y_pred, 'b-', label='Prediction')
plt.fill(np.concatenate([x, x[::-1]]),
         np.concatenate([y_pred - 1.9600 * sigma,
                        (y_pred + 1.9600 * sigma)[::-1]]),
         alpha=.5, fc='b', ec='None', label='95% confidence interval')
plt.xlabel('$x$')
plt.ylabel('$f(x)$')
plt.ylim(-10, 20)
plt.legend(loc='upper left')

# ----------------------------------------------------------------------
# now the noisy case
X = np.linspace(0.1, 9.9, 20)
X = np.atleast_2d(X).T

# Observations and noise
y = f(X).ravel()
dy = 0.5 + 1.0 * np.random.random(y.shape)
noise = np.random.normal(0, dy)
y += noise

# Instantiate a Gaussian Process model
gp = GaussianProcessRegressor(kernel=kernel, alpha=dy ** 2,
                              n_restarts_optimizer=10)

# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, y)

# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred, sigma = gp.predict(x, return_std=True)

# Plot the function, the prediction and the 95% confidence interval based on
# the MSE
plt.figure()
plt.plot(x, f(x), 'r:', label=r'$f(x) = x\,\sin(x)$')
plt.errorbar(X.ravel(), y, dy, fmt='r.', markersize=10, label='Observations')
plt.plot(x, y_pred, 'b-', label='Prediction')
plt.fill(np.concatenate([x, x[::-1]]),
         np.concatenate([y_pred - 1.9600 * sigma,
                        (y_pred + 1.9600 * sigma)[::-1]]),
         alpha=.5, fc='b', ec='None', label='95% confidence interval')
plt.xlabel('$x$')
plt.ylabel('$f(x)$')
plt.ylim(-10, 20)
plt.legend(loc='upper left')

plt.show()
```

**脚本的总运行时间**：(0分0.478秒)

[`Download Python source code: plot_gpr_noisy_targets.py`](https://scikit-learn.org/stable/_downloads/49797ed79238effefc5ba8bad49bb25c/plot_gpr_noisy_targets.py)

[`Download Jupyter notebook: plot_gpr_noisy_targets.ipynb`](https://scikit-learn.org/stable/_downloads/053f85f102c07abae7dc20a87b112911/plot_gpr_noisy_targets.ipynb)

## 15.9 Mauna Loa CO2数据的高斯过程回归(GPR)

此示例基于“用于机器学习的高斯过程”第5.4.3节[RW 2006]。他给出了一个在对数边际似然上用梯度上升法进行复杂核工程和超参数优化的例子。这些数据包括1958年至2001年期间在夏威 Mauna Loa 观测站收集的大气二氧化碳平均浓度(按体积计算，以百万分之数(ppmv)计)。目的是模拟二氧化碳浓度随时间t的变化。

内核由几个术语组成，它们负责解释信号的不同属性：

- 一个长期的，平稳的上升趋势可以用RBF核来解释。长尺度较大的径向基函数(RBF)内核强制成平滑，没有强制趋势上升，这就留给GP选择。长度、比例尺和振幅是自由的超参数。
- 季节性因素，由定期的 ExpSineSquared 内核解释，固定周期为1年。 该周期分量的长度尺度控制其平滑度是一个自由参数。 为了使准确周期性的衰减，采用带有RBF内核的产品。 该RBF组件的长度尺寸控制衰减时间，并且是另一个自由参数。

- 较小的中期不规则性将由 RationalQuadratic 核来解释， RationalQuadratic 内核组件的长度尺度和 alpha 参数决定长度尺度的扩散性。 根据[[RW2006\]](https://sklearn.apachecn.org/docs/master/8.html#rw2006)，这些不规则性可以更好地由 RationalQuadratic 来解释， 而不是 RBF 核，这可能是因为它可以容纳几个长度尺度。
- “noise(噪声)” 一词，由一个 RBF 内核贡献组成，它将解释相关的噪声分量，如局部天气现象以及 WhiteKernel 对白噪声的贡献。 相对幅度和RBF的长度尺度是进一步的自由参数。

在减去目标平均值后最大化对数边际似然产生下列内核，其中LML为-83.214:

```python
34.4**2 * RBF(length_scale=41.8)
+ 3.27**2 * RBF(length_scale=180) * ExpSineSquared(length_scale=1.44,
                                                   periodicity=1)
+ 0.446**2 * RationalQuadratic(alpha=17.7, length_scale=0.957)
+ 0.197**2 * RBF(length_scale=0.138) + WhiteKernel(noise_level=0.0336)
```

因此，大多数目标信号（34.4ppm）由长期上升趋势（长度为41.8年）解释。 周期分量的振幅为3.27ppm，衰减时间为180年，长度为1.44。 长时间的衰变时间表明我们在当地非常接近周期性的季节性成分。 相关噪声的幅度为0.197ppm，长度为0.138年，白噪声贡献为0.197ppm。 因此，整体噪声水平非常小，表明该模型可以很好地解释数据。 该图还显示，该模型直到2015年左右才能做出置信度比较高的预测。

```python
GPML kernel: 66**2 * RBF(length_scale=67) + 2.4**2 * RBF(length_scale=90) * ExpSineSquared(length_scale=1.3, periodicity=1) + 0.66**2 * RationalQuadratic(alpha=0.78, length_scale=1.2) + 0.18**2 * RBF(length_scale=0.134) + WhiteKernel(noise_level=0.0361)
Log-marginal-likelihood: 155.006

Learned kernel: 2.59**2 * RBF(length_scale=51) + 0.257**2 * RBF(length_scale=137) * ExpSineSquared(length_scale=2.15, periodicity=1) + 0.118**2 * RationalQuadratic(alpha=2.32, length_scale=70.6) + 0.03**2 * RBF(length_scale=1.01) + WhiteKernel(noise_level=0.001)
Log-marginal-likelihood: 1161.609
```

```python
# Authors: Jan Hendrik Metzen <jhm@informatik.uni-bremen.de>
#
# License: BSD 3 clause

import numpy as np

from matplotlib import pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels \
    import RBF, WhiteKernel, RationalQuadratic, ExpSineSquared

print(__doc__)

def load_mauna_loa_atmospheric_co2():
    ml_data = fetch_openml(data_id=41187)
    months = []
    ppmv_sums = []
    counts = []

y = ml_data.data[:, 0]
    m = ml_data.data[:, 1]
    month_float = y + (m - 1) / 12
    ppmvs = ml_data.target

for month, ppmv in zip(month_float, ppmvs):
        if not months or month != months[-1]:
            months.append(month)
            ppmv_sums.append(ppmv)
            counts.append(1)
        else:
            # aggregate monthly sum to produce average
            ppmv_sums[-1] += ppmv
            counts[-1] += 1

months = np.asarray(months).reshape(-1, 1)
    avg_ppmvs = np.asarray(ppmv_sums) / counts
    return months, avg_ppmvs

X, y = load_mauna_loa_atmospheric_co2()

# Kernel with parameters given in GPML book
k1 = 66.0**2 * RBF(length_scale=67.0)  # long term smooth rising trend
k2 = 2.4**2 * RBF(length_scale=90.0) \
    * ExpSineSquared(length_scale=1.3, periodicity=1.0)  # seasonal component
# medium term irregularity
k3 = 0.66**2 \
    * RationalQuadratic(length_scale=1.2, alpha=0.78)
k4 = 0.18**2 * RBF(length_scale=0.134) \
    + WhiteKernel(noise_level=0.19**2)  # noise terms
kernel_gpml = k1 + k2 + k3 + k4

gp = GaussianProcessRegressor(kernel=kernel_gpml, alpha=0,
                              optimizer=None, normalize_y=True)
gp.fit(X, y)

print("GPML kernel: %s" % gp.kernel_)
print("Log-marginal-likelihood: %.3f"
      % gp.log_marginal_likelihood(gp.kernel_.theta))

# Kernel with optimized parameters
k1 = 50.0**2 * RBF(length_scale=50.0)  # long term smooth rising trend
k2 = 2.0**2 * RBF(length_scale=100.0) \
    * ExpSineSquared(length_scale=1.0, periodicity=1.0,
                     periodicity_bounds="fixed")  # seasonal component
# medium term irregularities
k3 = 0.5**2 * RationalQuadratic(length_scale=1.0, alpha=1.0)
k4 = 0.1**2 * RBF(length_scale=0.1) \
    + WhiteKernel(noise_level=0.1**2,
                  noise_level_bounds=(1e-3, np.inf))  # noise terms
kernel = k1 + k2 + k3 + k4

gp = GaussianProcessRegressor(kernel=kernel, alpha=0,
                              normalize_y=True)
gp.fit(X, y)

print("\nLearned kernel: %s" % gp.kernel_)
print("Log-marginal-likelihood: %.3f"
      % gp.log_marginal_likelihood(gp.kernel_.theta))

X_ = np.linspace(X.min(), X.max() + 30, 1000)[:, np.newaxis]
y_pred, y_std = gp.predict(X_, return_std=True)

# Illustration
plt.scatter(X, y, c='k')
plt.plot(X_, y_pred)
plt.fill_between(X_[:, 0], y_pred - y_std, y_pred + y_std,
                 alpha=0.5, color='k')
plt.xlim(X_.min(), X_.max())
plt.xlabel("Year")
plt.ylabel(r"CO$_2$ in ppm")
plt.title(r"Atmospheric CO$_2$ concentration at Mauna Loa")
plt.tight_layout()
plt.show()
```

**脚本的总运行时间**：(0分9.429秒)

[`Download Python source code: plot_gpr_co2.py`](https://scikit-learn.org/stable/_downloads/22924b84e8589b1ceef00d12afcbd40e/plot_gpr_co2.py)

[`Download Jupyter notebook: plot_gpr_co2.ipynb`](https://scikit-learn.org/stable/_downloads/539cd4c347051b3af27de0cae21eb231/plot_gpr_co2.ipynb)

## 15.10 离散数据结构上的高斯过程

此示例说明如何使用高斯过程对非固定长度特征向量形式的数据进行回归和分类任务。这是通过核函数在离散结构(例如可变长度序列、树和图)直接操作实现的。

具体来说，这里的输入变量是一些以可变长度字符串形式存储的由字母‘A’、‘T’、‘C’和‘G’组成的基因序列，而在回归和分类任务中，输出变量分别是浮点数和真/假标签。

基因序列之间的核是用R-convolution[[1](https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_on_structured_data.html#id2)]定义的，它是通过在一对字符串之间的所有字母对上集成一个二进制字母级核来定义的。

此示例将生成三个图。

在第一个图中，我们使用颜色映射来可视化内核的值，即序列的相似性。这里较亮的颜色表示更高的相似性。

在第二个图中，我们在6个序列的数据集上给出了一些回归结果。这里我们使用第一、第二、第四和第五序列作为训练集，对第三和第六序列进行预测。

在第三张图中，我们通过对6序列进行训练来演示一个分类模型，并对另外5个序列进行预测。这里的基本事实是，序列中至少有一个‘A’。在这里，模型进行了四种正确的分类，其中一种分类失败。

> [1](https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_on_structured_data.html#id1) Haussler, D. (1999). Convolution kernels on discrete structures (Vol. 646). Technical report, Department of Computer Science, University of California at Santa Cruz.

```python
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process.kernels import Kernel, Hyperparameter
from sklearn.gaussian_process.kernels import GenericKernelMixin
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.base import clone

class SequenceKernel(GenericKernelMixin, Kernel):
    '''
    A minimal (but valid) convolutional kernel for sequences of variable
    lengths.'''
    def __init__(self,
                 baseline_similarity=0.5,
                 baseline_similarity_bounds=(1e-5, 1)):
        self.baseline_similarity = baseline_similarity
        self.baseline_similarity_bounds = baseline_similarity_bounds

@property
    def hyperparameter_baseline_similarity(self):
        return Hyperparameter("baseline_similarity",
                              "numeric",
                              self.baseline_similarity_bounds)

def _f(self, s1, s2):
        '''
        kernel value between a pair of sequences
        '''
        return sum([1.0 if c1 == c2 else self.baseline_similarity
                   for c1 in s1
                   for c2 in s2])

def _g(self, s1, s2):
        '''
        kernel derivative between a pair of sequences
        '''
        return sum([0.0 if c1 == c2 else 1.0
                    for c1 in s1
                    for c2 in s2])

def __call__(self, X, Y=None, eval_gradient=False):
        if Y is None:
            Y = X

if eval_gradient:
            return (np.array([[self._f(x, y) for y in Y] for x in X]),
                    np.array([[[self._g(x, y)] for y in Y] for x in X]))
        else:
            return np.array([[self._f(x, y) for y in Y] for x in X])

def diag(self, X):
        return np.array([self._f(x, x) for x in X])

def is_stationary(self):
        return False

def clone_with_theta(self, theta):
        cloned = clone(self)
        cloned.theta = theta
        return cloned

kernel = SequenceKernel()

'''
Sequence similarity matrix under the kernel
===========================================
'''

X = np.array(['AGCT', 'AGC', 'AACT', 'TAA', 'AAA', 'GAACA'])

K = kernel(X)
D = kernel.diag(X)

plt.figure(figsize=(8, 5))
plt.imshow(np.diag(D**-0.5).dot(K).dot(np.diag(D**-0.5)))
plt.xticks(np.arange(len(X)), X)
plt.yticks(np.arange(len(X)), X)
plt.title('Sequence similarity under the kernel')

'''
Regression
==========
'''

X = np.array(['AGCT', 'AGC', 'AACT', 'TAA', 'AAA', 'GAACA'])
Y = np.array([1.0, 1.0, 2.0, 2.0, 3.0, 3.0])

training_idx = [0, 1, 3, 4]
gp = GaussianProcessRegressor(kernel=kernel)
gp.fit(X[training_idx], Y[training_idx])

plt.figure(figsize=(8, 5))
plt.bar(np.arange(len(X)), gp.predict(X), color='b', label='prediction')
plt.bar(training_idx, Y[training_idx], width=0.2, color='r',
        alpha=1, label='training')
plt.xticks(np.arange(len(X)), X)
plt.title('Regression on sequences')
plt.legend()

'''
Classification
==============
'''

X_train = np.array(['AGCT', 'CGA', 'TAAC', 'TCG', 'CTTT', 'TGCT'])
# whether there are 'A's in the sequence
Y_train = np.array([True, True, True, False, False, False])

gp = GaussianProcessClassifier(kernel)
gp.fit(X_train, Y_train)

X_test = ['AAA', 'ATAG', 'CTC', 'CT', 'C']
Y_test = [True, True, False, False, False]

plt.figure(figsize=(8, 5))
plt.scatter(np.arange(len(X_train)), [1.0 if c else -1.0 for c in Y_train],
            s=100, marker='o', edgecolor='none', facecolor=(1, 0.75, 0),
            label='training')
plt.scatter(len(X_train) + np.arange(len(X_test)),
            [1.0 if c else -1.0 for c in Y_test],
            s=100, marker='o', edgecolor='none', facecolor='r', label='truth')
plt.scatter(len(X_train) + np.arange(len(X_test)),
            [1.0 if c else -1.0 for c in gp.predict(X_test)],
            s=100, marker='x', edgecolor=(0, 1.0, 0.3), linewidth=2,
            label='prediction')
plt.xticks(np.arange(len(X_train) + len(X_test)),
           np.concatenate((X_train, X_test)))
plt.yticks([-1, 1], [False, True])
plt.title('Classification on sequences')
plt.legend()

plt.show()
```

**脚本的总运行时间**：(0分0.260秒)

[`Download Python source code: plot_gpr_on_structured_data.py`](https://scikit-learn.org/stable/_downloads/d2c3d354a93eca3b78b2436d5a8e7164/plot_gpr_on_structured_data.py)

[`Download Jupyter notebook: plot_gpr_on_structured_data.ipynb`](https://scikit-learn.org/stable/_downloads/5612f9c55259a4294f34843655f9c6af/plot_gpr_on_structured_data.ipynb)