datasets of sklearn
datasets
sklearn提供了一些内置的小的玩具数据。
也可以加载外部的一些数据。
节省招数据的过程。
The
sklearn.datasetspackage embeds some small toy datasets as introduced in the Getting Started section.This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithms on data that comes from the ‘real world’.
To evaluate the impact of the scale of the dataset (
n_samplesandn_features) while controlling the statistical properties of the data (typically the correlation and informativeness of the features), it is also possible to generate synthetic data.
API
存在三类数据加载接口
(1)内置小的数据加载。load_xxx
(2) 外部大数据加载。 fetch_xxx
(3) 伪造数据。 make_xxx
There are three main kinds of dataset interfaces that can be used to get datasets depending on the desired type of dataset.
The dataset loaders. They can be used to load small standard datasets, described in the Toy datasets section.
The dataset fetchers. They can be used to download and load larger datasets, described in the Real world datasets section.
Both loaders and fetchers functions return a
sklearn.utils.Bunchobject holding at least two items: an array of shapen_samples*n_featureswith keydata(except for 20newsgroups) and a numpy array of lengthn_samples, containing the target values, with keytarget.The dataset generation functions. They can be used to generate controlled synthetic datasets, described in the Generated datasets section.
These functions return a tuple
(X, y)consisting of an_samples*n_featuresnumpy arrayXand an array of lengthn_samplescontaining the targetsy.
加载内部数据
scikit-learn comes with a few small standard datasets that do not require to download any file from some external website.
They can be loaded using the following functions:
load_boston(*[, return_X_y])Load and return the boston house-prices dataset (regression).
load_iris(*[, return_X_y, as_frame])Load and return the iris dataset (classification).
load_diabetes(*[, return_X_y, as_frame])Load and return the diabetes dataset (regression).
load_digits(*[, n_class, return_X_y, as_frame])Load and return the digits dataset (classification).
load_linnerud(*[, return_X_y, as_frame])Load and return the physical excercise linnerud dataset.
load_wine(*[, return_X_y, as_frame])Load and return the wine dataset (classification).
load_breast_cancer(*[, return_X_y, as_frame])Load and return the breast cancer wisconsin dataset (classification).
These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in scikit-learn. They are however often too small to be representative of real world machine learning tasks.
iris

鸢尾(学名:Iris tectorum Maxim.),又称蓝蝴蝶、紫蝴蝶、扁竹花等,属天门冬目,鸢尾科多年生草本,根状茎粗壮,花蓝紫色,蒴果长椭圆形或倒卵形。
鸢尾原产于中国中部以及日本,主要分布在中国中南部,可供观赏,花香气淡雅,可以调制香水,其根状茎可作中药,全年可采。
iris data loaded result
外部数据加载
scikit-learn provides tools to load larger datasets, downloading them if necessary.
They can be loaded using the following functions:
fetch_olivetti_faces(*[, data_home, …])Load the Olivetti faces data-set from AT&T (classification).
fetch_20newsgroups(*[, data_home, subset, …])Load the filenames and data from the 20 newsgroups dataset (classification).
fetch_20newsgroups_vectorized(*[, subset, …])Load the 20 newsgroups dataset and vectorize it into token counts (classification).
fetch_lfw_people(*[, data_home, funneled, …])Load the Labeled Faces in the Wild (LFW) people dataset (classification).
fetch_lfw_pairs(*[, subset, data_home, …])Load the Labeled Faces in the Wild (LFW) pairs dataset (classification).
fetch_covtype(*[, data_home, …])Load the covertype dataset (classification).
fetch_rcv1(*[, data_home, subset, …])Load the RCV1 multilabel dataset (classification).
fetch_kddcup99(*[, subset, data_home, …])Load the kddcup99 dataset (classification).
fetch_california_housing(*[, data_home, …])Load the California housing dataset (regression).
伪造数据
In addition, scikit-learn includes various random sample generators that can be used to build artificial datasets of controlled size and complexity.
7.4.1. Generators for classification and clustering
These generators produce a matrix of features and corresponding discrete targets.
7.4.1.1. Single label
Both
make_blobsandmake_classificationcreate multiclass datasets by allocating each class one or more normally-distributed clusters of points.make_blobsprovides greater control regarding the centers and standard deviations of each cluster, and is used to demonstrate clustering.make_classificationspecialises in introducing noise by way of: correlated, redundant and uninformative features; multiple Gaussian clusters per class; and linear transformations of the feature space.
make_gaussian_quantilesdivides a single Gaussian cluster into near-equal-size classes separated by concentric hyperspheres.make_hastie_10_2generates a similar binary, 10-dimensional problem.![]()
make_circlesandmake_moonsgenerate 2d binary classification datasets that are challenging to certain algorithms (e.g. centroid-based clustering or linear classification), including optional Gaussian noise. They are useful for visualisation.make_circlesproduces Gaussian data with a spherical decision boundary for binary classification, whilemake_moonsproduces two interleaving half circles.


浙公网安备 33010602011771号