sklearn.datasets.fetch_rcv1¶

sklearn.datasets.fetch_rcv1(*, data_home=None, subset='all', download_if_missing=True, random_state=None, shuffle=False, return_X_y=False)

[源码]

加载RCV1多标签数据集（分类）。

如有必要，请下载。

版本：RCV1-v2，向量，全集，多标签主题。

类	103
样本总数	804414
维度	47236
特征	real, between 0 and 1

在用户指南中阅读更多内容。

版本0.17中的新功能。

参数	说明
data_home	string, optional 为数据集指定另一个下载和缓存文件夹。默认情况下，所有scikit-learn数据都存储在“〜/ scikit_learn_data”子文件夹中。
subset	string, ‘train’, ‘test’, or ‘all’, default=’all’ 选择要加载的数据集：“train”用于训练集（23149个样本），“test”用于测试集（781265个样本），“all”表示同时加载，如果shuffle为False，则首先使用训练样本。这是按照LYRL2004官方时间顺序进行的。
download_if_missing	boolean, default=True 如果为False，则在数据不在本地可用时引发IOError，而不是尝试从源站点下载数据。
random_state	int, RandomState instance, default=None 确定用于数据集shuffle的随机数生成。为多个函数调用传递可重复输出的int值。请参阅词汇表。
shuffle	bool, default=False 是否shuffle数据集。
return_X_y	boolean, default=False. 如果为True，则返回（dataset.data，dataset.target）而不是Bunch对象。请参阅下文，以获取有关dataset.data和dataset.target对象的更多信息。 0.20版中的新功能。

返回值说明

dataset Bunch
类字典对象，具有以下属性。
- data:scipy csr array, dtype np.float64, shape (804414, 47236)
数组具有0.16％的非零值。
- target:scipy csr array, dtype np.uint8, shape (804414, 103)
每个样本在其类别中的值为1，在其他类别中的值为0。数组具有3.15％的非零值。
- sample_id:numpy array, dtype np.uint32, shape (804414,)
每个样本的标识号，按dataset.data中的顺序。
- target_namesnumpy array, dtype object, length (103)
每个target的名称（RCV1主题），按dataset.target中的顺序排列。
- DESCR:string
RCV1数据集的描述。

(data, target) tuple if return_X_y is True
0.20版中的新功能。

返回值	说明
dataset	`Bunch` 类字典对象，具有以下属性。 - data:scipy csr array, dtype np.float64, shape (804414, 47236) 数组具有0.16％的非零值。 - target:scipy csr array, dtype np.uint8, shape (804414, 103) 每个样本在其类别中的值为1，在其他类别中的值为0。数组具有3.15％的非零值。 - sample_id:numpy array, dtype np.uint32, shape (804414,) 每个样本的标识号，按dataset.data中的顺序。 - target_namesnumpy array, dtype object, length (103) 每个target的名称（RCV1主题），按dataset.target中的顺序排列。 - DESCR:string RCV1数据集的描述。
(data, target)	tuple if `return_X_y` is True 0.20版中的新功能。