Decision Tree in R (card.csv)

Credit Card Purchases

Data Set -> 数据集

Data Download Link:card.csv
The number of records: 9999
nine 9 Variables can be used for decision tree generation
1 Output Variable: Level
5 Input Variable: Gender Age MaritalStatus OccupationCategory TotalTransation

Step 1: Read Data -> 读取数据

# 读取数据并保留1-9列的数据集
file.choose()
set.rc<-read.csv("card.csv",header=T)
rc<-set.rc[,1:9]

Step 2: Data Exploration -> 数据探究

dim(rc) 
# 探索对象的尺寸，即data size和number of variable
str(rc) 
# 探索对象内部的结构，即character of variable
attributes(rc) 
# 探索对象的属性， 即names和classes
head(rc) 
# 显示数据的前一部分
tail(rc) 
# 显示数据的后一部分
rc[1:10,] 
# 显示数据的1到10列
rc[1:10,"Gender"] 
# 显示数据中因素"性别"的1到10列

Step 3: Individual variable Exploration -> 个体数据探究

# 数据摘要、饼图、绘图
summary(rc)
table(rc$Level)
pie(table(rc$Level))
plot(rc$Level)

Step 4: Train and Test Data Set Generation -> 建立训练集合和测试集合

# 数据集进行取样，按照7:3的比列分为traindata 和 testdata集
set.seed(1234)
ind<-sample(2,nrow(rc),replace=TRUE,prob=c(0.7,0.3))
traindata <- rc[ind==1,]
testdata <- rc[ind==2,]
nrow(traindata)
nrow(testdata)

Step 5: Decision Tree Generation -> 建立决策树

# 调入数据包裹party、建立公式、在训练集建立决策树(ctree)、在测试集预测决策树(ctree)、列出原有数据与预测数据的Matrix表
library(party)
myformula<-Level~Gender+Age+MaritalStatus+OccupationCategory+TotalTransactions
train.tree<-ctree(myformula,data=traindata)
test.pred<-predict(train.tree, newdata=testdata)
table(test.pred,testdata$Level)

Step 6: Printing the Rules and Plotting the Decision Tree before pruning -> 印刷规则以及绘制决策树图

# 导出terminal node的分析结果并绘图
print(rc_ctree)
plot(rc_ctree)
plot(train.tree,type="simple")

Figure 1. Decision Tree before Pruning

Step 7: Pruning -> 剪枝

labrary(rpart)
card.rpart <- rpart(myformula,data = traindata,control = rpart.control(minsplit = 10))
# 印刷 cp(complexity parameter) table 用于确定决策树是否合适
print(card.rpart$cptable)

rel error is estimated with the training data

xerror (and xstd) is estimated with testing data

When rel error is decreases as the tree increases, because the tree becomes more and more adjusted to the dataThis apparently better performance should not be taken for "real" when predicting for a new sample of data because larger trees do tend to overfit the traning sample and will hardly generalise well on new fresh data samples.

That's the motivation for the xerror (and xstd) estimates. These are more realistic estimates of the performance of the tree on new samples of data. They are obtained by the rpart function by an internal cross validation process.

# 从cptable中提取xerror的最小值所对应的的cp，用于设定剪枝力度
opt <- which.min(card.rpart$cptable[,"xerror"])
cp <-card.rpart$cptable[opt,"CP"]
card.prune <- prune(card.rpart, cp = cp)

Step 8: Printing the Rules and Plotting the Decision Tree after pruning --> 印刷规则以及绘制决策树图

# 导出terminal node的分析结果并绘图
print(card.prune)
plot(card.prune)
text(card.prune,use.n=TRUE)

Figure 2. Decision Tree after Pruning

posted @ 2013-06-08 22:27 jinyulogin 阅读(873) 评论(0) 收藏举报

刷新页面返回顶部

Decision Tree in R (card.csv)

公告