ZhangZhihui's Blog  

In machine learning, data is often split into three partstraining, validation, and testing—to ensure that the model learns effectively and generalizes well to new, unseen data. Here's what each part is used for:

 

1. Training Data

Purpose:
Used to train the model—that is, to adjust the model’s weights/parameters to minimize the error on this data.

Process:

  • The model sees this data repeatedly.

  • Optimization algorithms (like gradient descent) use this data to minimize a loss function.

Analogy:
Like studying for a test using a textbook—you learn from it directly.


2. Validation Data

Purpose:
Used to tune hyperparameters (like learning rate, regularization strength, number of layers, etc.) and monitor model performance during training to prevent overfitting.

Process:

  • Not used to train the model.

  • Helps decide when to stop training (early stopping).

  • Used for model selection (e.g., choosing the best performing model version).

Analogy:
Like taking practice tests to see how well you're doing and adjusting your study strategy.


3. Test Data

Purpose:
Used to evaluate the final model's performance. It simulates how the model will behave in the real world, on completely unseen data.

Process:

  • Used once after training and validation are complete.

  • Gives an unbiased estimate of the model’s generalization ability.

Analogy:
Like the final exam—your performance here reflects your true learning.


Summary Table:

Data TypeUsed ForSeen During Training?Affects Model Weights?
Training Learn model parameters ✅ Yes ✅ Yes
Validation Tune hyperparameters, early stop ✅ Yes (but no learning) ❌ No
Test Final performance evaluation ❌ No ❌ No

 

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load data
X, y = load_iris(return_X_y=True)

# 2. Split into temp (80%) and test (20%)
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Further split temp into training (60%) and validation (20%)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)

# 4. Define model and hyperparameter grid
model = LogisticRegression(max_iter=200)
param_grid = {
    'C': [0.1, 1, 10],
    'solver': ['liblinear', 'lbfgs'],
    'penalty': ['l2']
}

# 5. Use validation set with GridSearchCV to find best hyperparameters
grid_search = GridSearchCV(model, param_grid, cv=3)
grid_search.fit(X_train, y_train)

print("Best parameters from validation search:", grid_search.best_params_)

# 6. Evaluate best model on validation set
val_pred = grid_search.predict(X_val)
val_accuracy = accuracy_score(y_val, val_pred)
print("Validation Accuracy:", val_accuracy)

# 7. Final test (simulate unseen data)
final_model = grid_search.best_estimator_
test_pred = final_model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_pred)
print("Test Accuracy:", test_accuracy)

 

posted on 2025-07-02 22:49  ZhangZhihuiAAA  阅读(19)  评论(0)    收藏  举报