In machine learning, data is often split into three parts—training, validation, and testing—to ensure that the model learns effectively and generalizes well to new, unseen data. Here's what each part is used for:
1. Training Data
Purpose:
Used to train the model—that is, to adjust the model’s weights/parameters to minimize the error on this data.
Process:
-
The model sees this data repeatedly.
-
Optimization algorithms (like gradient descent) use this data to minimize a loss function.
Analogy:
Like studying for a test using a textbook—you learn from it directly.
2. Validation Data
Purpose:
Used to tune hyperparameters (like learning rate, regularization strength, number of layers, etc.) and monitor model performance during training to prevent overfitting.
Process:
-
Not used to train the model.
-
Helps decide when to stop training (early stopping).
-
Used for model selection (e.g., choosing the best performing model version).
Analogy:
Like taking practice tests to see how well you're doing and adjusting your study strategy.
3. Test Data
Purpose:
Used to evaluate the final model's performance. It simulates how the model will behave in the real world, on completely unseen data.
Process:
-
Used once after training and validation are complete.
-
Gives an unbiased estimate of the model’s generalization ability.
Analogy:
Like the final exam—your performance here reflects your true learning.
Summary Table:
| Data Type | Used For | Seen During Training? | Affects Model Weights? |
|---|---|---|---|
| Training | Learn model parameters | ✅ Yes | ✅ Yes |
| Validation | Tune hyperparameters, early stop | ✅ Yes (but no learning) | ❌ No |
| Test | Final performance evaluation | ❌ No | ❌ No |
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # 1. Load data X, y = load_iris(return_X_y=True) # 2. Split into temp (80%) and test (20%) X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 3. Further split temp into training (60%) and validation (20%) X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42) # 4. Define model and hyperparameter grid model = LogisticRegression(max_iter=200) param_grid = { 'C': [0.1, 1, 10], 'solver': ['liblinear', 'lbfgs'], 'penalty': ['l2'] } # 5. Use validation set with GridSearchCV to find best hyperparameters grid_search = GridSearchCV(model, param_grid, cv=3) grid_search.fit(X_train, y_train) print("Best parameters from validation search:", grid_search.best_params_) # 6. Evaluate best model on validation set val_pred = grid_search.predict(X_val) val_accuracy = accuracy_score(y_val, val_pred) print("Validation Accuracy:", val_accuracy) # 7. Final test (simulate unseen data) final_model = grid_search.best_estimator_ test_pred = final_model.predict(X_test) test_accuracy = accuracy_score(y_test, test_pred) print("Test Accuracy:", test_accuracy)

浙公网安备 33010602011771号