Scikit-learn review

Basic use of scikit-learn

Scikit-learn(sklearn) can be regarded as a repository of machine learning algorithms and some auxiliary functions. It is not a precise description of it, but from the practical point of view, the definition given is very close to the actual situation. What we make use of it is mainly about: importing a machine learning algorithm, setting parameters, training, predicting, and evaluating.

Decide which machine learning algorithm to use -> import it from sklearm -> set parameters -> train -> predict -> evaluate

Here is a list of machine learning algorithms and two useful functions provided by sklearn:

# split the dataset into training set and test set
from sklearn.model_selections import train_test_split
# accuracy score
from sklearn.metrics import accuracy_score
# linear regression and logistic regression
from sklearn.linear_model import LinearRegression, LogisticRegression
# decision tree
from sklearn.tree import DecisionTreeClassifier
# random forest
from sklearn.ensemble import RandomForestClassifier
# k-nearest neighbors
from sklearn.neighbors import KNeighborsClassifier
# support vector machine
from sklearn.svm import SVC

Also, sklearn provides some basic datasets that can be used for testing purposes.

from sklearn import datasets
iris = datasets.load_iris()
# and so on, can be easily found on websites

The datasets are actually not dataframes in pandas, so to access the data, we need to use the key 'data' or 'target'.

After some essential preprocessing and analysis, we can start training the model.

model = SVC(kernel='linear', C=1)
model.fit(x_train, y_train)
# Here, the parameters of the model have been tuned.
y_pred = model.predict(x_test)
print(accuracy_score(y_test, y_pred))

Additionally, we can use meshgrid search to find the best parameters for the model.

Here is an example of the whole process:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

"""Load the Iris dataset"""
iris = datasets.load_iris()
# print(iris.DESCR)
df_Setosa = pd.DataFrame(iris.data[iris.target == 0], columns=iris.feature_names)
df_Setosa["species"] = "setosa"
df_Versicolor = pd.DataFrame(iris.data[iris.target == 1], columns=iris.feature_names)
df_Versicolor["species"] = "versicolor"
df_Virginica = pd.DataFrame(iris.data[iris.target == 2], columns=iris.feature_names)
df_Virginica["species"] = "virginica"
df_iris = pd.concat([df_Setosa, df_Versicolor, df_Virginica], axis=0)
# print(df_iris.sample(5))
#     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)    species
# 46                6.3               2.5                5.0               1.9  virginica
# 48                5.3               3.7                1.5               0.2     setosa
# 22                7.7               2.8                6.7               2.0  virginica
# 5                 5.4               3.9                1.7               0.4     setosa
# 41                4.5               2.3                1.3               0.3     setosa

"""Analysis of the Iris dataset"""
## Check for missing values
# print(df_iris.info())
# no nan values, quite clean dataset
## Check for outliers
# print(df_iris.describe())
# no outliers
# -------------------------------------------------------------
# sns.pairplot(df_iris, hue="species")
# plt.show()
# -------------------------------------------------------------
# some significant correlations found in all 3 species between:
corr = df_iris.drop(columns="species").corr()
# print(corr)
# petal length vs petal width
# corr = 0.962865
# petal length vs sepal length
# corr = 0.871754
# -------------------------------------------------------------
## petal length, petal width and sepal length
# setosa has the shortest petals and sepals in average
# vergenica has the longest
# while versicolor stays in the midist place
## sepal width
# setosa has the widest sepals in average
# while the other two species have similar sepal widths

""""Split the dataset(for regression) into training and testing sets"""
x = df_iris["petal width (cm)"].to_numpy().reshape(-1, 1)
y = df_iris["petal length (cm)"].to_numpy().reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42
)

"""Regression with Linear Regression"""
reg_model = LinearRegression()
reg_model.fit(x_train, y_train)
y_pred = reg_model.predict(x_test)
accuracy = reg_model.score(x_test, y_test)
# print(f"Linear Regression Accuracy: {(100 * accuracy):.2f}%")
# 92.86%, not bad

""""Split the dataset(for classification) into training and testing sets"""
encode_species = {
    "setosa": 0,
    "versicolor": 1,
    "virginica": 2
}
x = df_iris["petal width (cm)"].to_numpy().reshape(-1, 1)
y = df_iris["species"].apply(lambda x: encode_species[x]).to_numpy()
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42
)
# sns.boxplot(x=y_train.flatten(), y=x_train.flatten())
# plt.ylabel("Petal Width (cm)")
# plt.xlabel("Species")
# plt.show()

"""Classification with Logistic Regression"""
clf_model = LogisticRegression()
clf_model.fit(x_train, y_train)
y_pred = clf_model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
# print(f"Logistic Regression Accuracy: {(100 * accuracy):.2f}%")
# 100.00%, perfect classification

"""Classification with Support Vector Machines"""
clf_model = SVC(kernel='linear', C=1.0)
clf_model.fit(x_train, y_train)
y_pred = clf_model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"SVM Accuracy: {(100 * accuracy):.2f}%")
# 100.00%, perfect classification

"""Classification with K-Nearest Neighbors"""
n_grid = np.arange(3, 11)
accuracy_lst = []
for n in n_grid:
    clf_model = KNeighborsClassifier(n_neighbors=n)
    clf_model.fit(x_train, y_train)
    y_pred = clf_model.predict(x_test)
    accuracy_lst.append(accuracy_score(y_test, y_pred))
n = np.argmax(accuracy_lst) + 3
# print(f"SVM Accuracy: {(100 * max(accuracy_lst)):.2f}%")
# 100.00%, perfect classification

"""Classification with Decision Trees"""
clf_model = DecisionTreeClassifier(max_depth=3)
clf_model.fit(x_train, y_train)
y_pred = clf_model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Decision Tree Accuracy: {(100 * accuracy):.2f}%")
# 100.00%, perfect classification

Remember that use notations to taking down your whole plan before you start, which is defenitely a good practice.

posted @ 2025-07-06 18:26  宣明晢  阅读(6)  评论(0)    收藏  举报