比较随机搜索和网格搜索以进行超参数估计¶
本案例比较带有SGD训练的随机搜索和网格搜索,以优化线性SVM的超参数。 同时搜索所有影响学习的参数(考虑到训练时间与质量的平衡,我们不考虑估计量的数量)。
随机搜索和网格搜索探索的参数空间完全相同。 参数设置的结果非常相似,而随机搜索的运行时间大大减少。
对于随机搜索,性能可能会稍差一些,不过这可能是由于噪声效应导致的,并且不会延续到保留的测试集中。
请注意,实际上,不会使用网格搜索同时搜索这么多不同的参数,而只会选择那些最重要的参数。
输出:
RandomizedSearchCV took 29.15 seconds for 20 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.920 (std: 0.028)
Parameters: {'alpha': 0.07316411520495676, 'average': False, 'l1_ratio': 0.29007760721044407}
Model with rank: 2
Mean validation score: 0.920 (std: 0.029)
Parameters: {'alpha': 0.0005223493320259539, 'average': True, 'l1_ratio': 0.7936977033574206}
Model with rank: 3
Mean validation score: 0.918 (std: 0.031)
Parameters: {'alpha': 0.00025790124268693137, 'average': True, 'l1_ratio': 0.5699649107012649}
GridSearchCV took 150.68 seconds for 100 candidate parameter settings.
Model with rank: 1
Mean validation score: 0.931 (std: 0.026)
Parameters: {'alpha': 0.0001, 'average': True, 'l1_ratio': 0.0}
Model with rank: 2
Mean validation score: 0.928 (std: 0.030)
Parameters: {'alpha': 0.0001, 'average': True, 'l1_ratio': 0.1111111111111111}
Model with rank: 3
Mean validation score: 0.927 (std: 0.026)
Parameters: {'alpha': 0.0001, 'average': True, 'l1_ratio': 0.5555555555555556}
输入:
print(__doc__)
import numpy as np
from time import time
import scipy.stats as stats
from sklearn.utils.fixes import loguniform
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier
# 获得一些数据
X, y = load_digits(return_X_y=True)
# 建立一个分类器
clf = SGDClassifier(loss='hinge', penalty='elasticnet',
fit_intercept=True)
# 实用功能呈现最佳成绩
def report(results, n_top=3):
for i in range(1, n_top + 1):
candidates = np.flatnonzero(results['rank_test_score'] == i)
for candidate in candidates:
print("Model with rank: {0}".format(i))
print("Mean validation score: {0:.3f} (std: {1:.3f})"
.format(results['mean_test_score'][candidate],
results['std_test_score'][candidate]))
print("Parameters: {0}".format(results['params'][candidate]))
print("")
# 指定要从中采样的参数和分布
param_dist = {'average': [True, False],
'l1_ratio': stats.uniform(0, 1),
'alpha': loguniform(1e-4, 1e0)}
# 运行随机搜索
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=n_iter_search)
start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
" parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)
# 对所有参数使用完整的网格搜索
param_grid = {'average': [True, False],
'l1_ratio': np.linspace(0, 1, num=10),
'alpha': np.power(10, np.arange(-4, 1, dtype=float))}
# 运行网格搜索
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)
print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
% (time() - start, len(grid_search.cv_results_['params'])))
report(grid_search.cv_results_)
脚本的总运行时间:(2分钟59.930秒)