【R语言学习笔记】7. 将数据划分为训练集、验证集和测试集

1. 目的：介绍将数据集划分为训练集、验证集和测试集的方法。

2. 数据来源：github https://github.com/reisanar/datasets/blob/master/WestRoxbury.csv

3. 此博客主要介绍划分数据的方法，因此不对变量做过多介绍。

4. 划分方法

4.1 将变量划分为训练集、验证集和测试集

Method 1：

## partitioning into training (50%), validation (30%), and test (20%) sets

# randomly sample 50% of the row IDs for training
train.rows <- sample(rownames(housing.df), dim(housing.df)[1]*0.5)

# sample 30% of the row IDs into the validation set, drawing only from records not already in the training set
# use setdiff() to find records not already in the training set
valid.rows <- sample(setdiff(rownames(housing.df), train.rows), dim(housing.df)[1]*0.3)

# assign the remaining 20% row IDs serve as test
test.rows <- setdiff(rownames(housing.df), union(train.rows, valid.rows))

# create the 3 data frames by collecting all columns from the appropriate rows 
housing.train <- housing.df[train.rows, ]
housing.valid <- housing.df[valid.rows, ]
housing.test <- housing.df[test.rows, ]

Method 2:

## alternative

train.rows <- sample(1:nrow(housing.df), dim(housing.df)[1]*0.5)
housing.train <- housing.df[train.rows,]

remain <- housing.df[-train.rows,]
valid.rows <- sample(1:nrow(remain), dim(housing.df)[1]*0.3) #dim(housing.df)[1]*0.3 -> be careful!
housing.valid <- remain[valid.rows,]
housing.test <- remain[-valid.rows,]

4.2 将数据划分为训练集和测试集

Method 1:

## partitioning into training (60%) and validation (40%)
set.seed(1) ## to get the same sequence of numbers

# randomly sample 60% of the row IDs for training; the remaining 40% serve as validation
train.rows <- sample(rownames(housing.df), dim(housing.df)[1]*0.6)

# collect all the columns with training row ID into training set:
housing.train <- housing.df[train.rows, ]

# assign row IDs that are not already in the training set, into validation 
valid.rows <- setdiff(rownames(housing.df), train.rows) 
housing.valid <- housing.df[valid.rows, ]

Method 2:

## alternative 1
# collect all the columns without training row ID into validation set 
train.rows <- sample(1:nrow(housing.df), dim(housing.df)[1]*0.6)
housing.train <- housing.df[train.rows, ]
housing.valid <- housing.df[-train.rows, ]

Method 3:

## alternative 2: generate random numbers
gp <- runif(nrow(housing.df)) # generate uniform random numbers
housing.train <- housing.df[gp < 0.6,]
housing.test <- housing.df[gp >=0.6,]

Method 4:

## alternative 3
n_obs <- nrow(housing.df) # get the number of observations
permuted_rows <- sample(n_obs) # shuffle row indices: permuted_rows
housing_shuffled <- housing.df[permuted_rows,] # randomly order data: Sonar
split <- round(n_obs * 0.6) # identify row to split on: split
housing.train <- housing_shuffled[1: split,] # create train
housing.test <- housing_shuffled[(split+1):nrow(housing_shuffled),] # create test

posted on 2020-01-29 00:56 shanshant 阅读(13575) 评论(0) 编辑收藏举报