机器学习库scikit-learn学习
一、获取数据
 from sklearn import datasets
	1.sklean自带数据集
		鸢尾花
			from sklearn import datasets
			datasets.load_iris()
		手写数字
			from sklearn.datasets import load_digits
			digits = load_digits()
			print(digits.data.shape)
			print(digits.target.shape)
			print(digits.images.shape)
	2.创建数据集
		生成随机数据
			from sklearn.datasets.samples_generator import make_classification
			X, y = make_classification(n_samples=6, n_features=5, n_informative=2,
				n_redundant=2, n_classes=2, n_clusters_per_class=2, scale=1.0,
				random_state=20)
			
		用sklearn.datasets.make_blobs来生成类别数据
			scikit中的make_blobs方法常被用来生成聚类算法的测试数据,直观地说,make_blobs会根据用户指定的特征数量,中心点数量,范围等来生成几类数据,这些数据可用于测试聚类算法的效果。	
			sklearn.datasets.make_blobs(n_samples=100, n_features=2, centers=3,cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True,random_state=None)[source]
		
		用sklearn.datasets.make_circles和make_moons来生成圆形\月形数据
			sklearn.datasets.make_circles(n_samples=100, shuffle=True, noise=None,random_state=None, factor=0.8)
			x1,y1=make_circles(n_samples=1000,factor=0.5,noise=0.1)
二、数据预处理
	from sklearn import preprocessing
	几大方法:fit transform fit_transform
	
	StandardScaler	平均变为0	标准差变为1
	最小-最大规范化	MinMaxScaler	变换到[0,1]区间
	正则化	X_normalized = preprocessing.normalize(X, norm='l2')
	one-hot编码\类别特征编码		OneHotEncoder	可以转换特征、类标,处理类标时要输入二维数组,r.fit_transform(np.array(a).reshape(-1,1)).toarray()
	特征二值化	Binarizer(threshold=1.1)		
	标签编码	LabelEncoder	转为整数
				LabelBinarizer	转为one-hot独热
三、数据集拆分
	from sklearn.model_selection import train_test_split
	(X_train, X_test,y_train,y_test) = train_test_split(X, y, test_size=0.25, random_state=0,shuffle=True)
	
	k折交叉验证:
		from sklearn.model_selection import cross_val_score
四、导入模型:
	线性回归
		from sklearn.linear_model import LinearRegression
		model = LinearRegression(fit_intercept=True, normalize=False,copy_X=True, n_jobs=1)
	逻辑回归LR
		from sklearn.linear_model import LogisticRegression
	朴素贝叶斯算法NB(Naive Bayes)	
		from sklearn import naive_bayes
	决策树DT	
		from sklearn.tree import DecisionTreeClassifier
	支持向量机SVM
		from sklearn.svm import SVC
	k近邻算法KNN
		from sklearn import neighbors
五、模型评估
	检验曲线
		from sklearn.model_selection import validation_curve
	
	from sklearn.metrics import confusion_matrix
六、保存模型:
	from sklearn.externals import joblib
	joblib.dump(model, 'model.pickle')
	model = joblib.load('model.pickle')
 
                    
                     
                    
                 
                    
                 
                
            
         
         浙公网安备 33010602011771号
浙公网安备 33010602011771号