Machine Learning : Pre-processing features

from:http://analyticsbot.ml/2016/10/machine-learning-pre-processing-features/

Machine Learning : Pre-processing features

I am participating in this Kaggle competition. It is a prediction problem contest. The problem statement is:

How severe is an insurance claim?

When you’ve been devastated by a serious car accident, your focus is on the things that matter the most: family, friends, and other loved ones. Pushing paper with your insurance agent is the last place you want your time or mental energy spent. This is why Allstate, a personal insurer in the United States, is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect.

Allstate is currently developing automated methods of predicting the cost, and hence severity, of claims. In this recruitment challenge, Kagglers are invited to show off their creativity and flex their technical chops by creating an algorithm which accurately predicts claims severity. Aspiring competitors will demonstrate insight into better ways to predict claims severity for the chance to be part of Allstate’s efforts to ensure a worry-free customer experience.

You can take a look at the data here. You can easily open the dataset in excel and take a look at the variables/features in the dataset. There are 116 categorical variables in the dataset and 14 continuous variables. Let’s start the analysis

Import all necessary modules.

All these modules should be installed on your machine. I am using Python 2.7.11. If you have to install these modules, you can simply do

Let’s read the datasets using pandas

Let’s take a look at the data

 

 

You might have noticed that we printed the same thing twice. Well, the first time Python prints a small number of columns and the first five observations but the second time it prints all columns and their 5 observations. This is because of the

Make sure you have the 5 in head else it will print everything on screen, which will not be pretty. Take a look at the columns present in the train and test set.

There is an ID column in both the data sets which we don’t need for any analysis. Moreover, we will keep the loss column from the training data set into a separate variable.

Let’s take a look at the continuous variables and their basic statistical analysis.

 

In many competitions, you’ll find there are some features that might be present in the training set but not in the test set and vice-versa.

In this case, we see that there are not different columns between train and test sets.

Let’s identify the categorical and continuous variables. For this data set, there are two ways to find them:

  • variables have ‘cat’ and ‘cont’ in them, defining them
  • pandas consider the data type as object

 

Correlation between continuous variables

Let’s take a look at correlation between the variables. The idea is to remove variables that are highly correlated.

 

Let’s take at the labels present in the categorical variables. Although we don’t have any different columns, what may happen is some labels might not be present in one or the other data set.

 

Let’s plot the categorical variables to see the distribution of the variables.

cat21 cat22 cat23 cat24 cat29 cat32 cat34 cat35 cat43 cat50 cat54 cat55 cat56 cat72 cat75 cat76 cat77 cat79 cat83 cat86 cat89 cat90 cat91 cat93 cat94 cat96 cat97 cat100 cat101 cat104 cat107 cat109 cat110 cat112 cat116

One Hot Encoding of categorical variables

Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature.

  1. The first way is to use dictvectorizer to encode the labels in the feature.

 

2.  Second method is to use pandas to get dummy variables

3. Some of these variables only have two labels and some have more than two. One way is to use factorize to convert these labels to numeric

4. Another way to is to mix the dummy variables and factorize

Here’s the full code

Happy Python-ing!

Posted in Machine LearningPython

posted @ 2016-10-24 22:48  止战  阅读(1105)  评论(0编辑  收藏  举报