Machine Learning : Pre-processing features
from:http://analyticsbot.ml/2016/10/machine-learning-pre-processing-features/
Machine Learning : Pre-processing features
I am participating in this Kaggle competition. It is a prediction problem contest. The problem statement is:
How severe is an insurance claim?
When you’ve been devastated by a serious car accident, your focus is on the things that matter the most: family, friends, and other loved ones. Pushing paper with your insurance agent is the last place you want your time or mental energy spent. This is why Allstate, a personal insurer in the United States, is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect.
Allstate is currently developing automated methods of predicting the cost, and hence severity, of claims. In this recruitment challenge, Kagglers are invited to show off their creativity and flex their technical chops by creating an algorithm which accurately predicts claims severity. Aspiring competitors will demonstrate insight into better ways to predict claims severity for the chance to be part of Allstate’s efforts to ensure a worry-free customer experience.
You can take a look at the data here. You can easily open the dataset in excel and take a look at the variables/features in the dataset. There are 116 categorical variables in the dataset and 14 continuous variables. Let’s start the analysis
Import all necessary modules.
| 1 2 3 4 5 6 7 8 9 | # import required libraries # pandas for reading data and manipulation # scikit learn to one hot encoder and label encoder # sns and matplotlib to visualize import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.feature_extraction import DictVectorizer import operator | 
All these modules should be installed on your machine. I am using Python 2.7.11. If you have to install these modules, you can simply do
| 1 2 3 4 | pip install <module name> Example: pip install pandas | 
Let’s read the datasets using pandas
| 1 2 3 | # read data from csv file train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') | 
Let’s take a look at the data
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | # let's take a look at the train and test data print '**************************************' print 'TRAIN DATA' print '**************************************' print train.head(5) print '**************************************' print 'TEST DATA' print '**************************************' print test.head(5) # the above code wont print all columns. # to print all columns pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None) # let's take a look at the train and test data again print '**************************************' print 'TRAIN DATA' print '**************************************' print train.head(5) print '**************************************' print 'TEST DATA' print '**************************************' print test.head(5) | 
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | TRAIN DATA **************************************    id cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9   ...        cont6  \ 0   1    A    B    A    B    A    A    A    A    B   ...     0.718367    1   2    A    B    A    A    A    A    A    A    B   ...     0.438917    2   5    A    B    A    A    B    A    A    A    B   ...     0.289648    3  10    B    B    A    B    A    A    A    A    B   ...     0.440945    4  11    A    B    A    B    A    A    A    A    B   ...     0.178193          cont7    cont8    cont9   cont10    cont11    cont12    cont13  \ 0  0.335060  0.30260  0.67135  0.83510  0.569745  0.594646  0.822493    1  0.436585  0.60087  0.35127  0.43919  0.338312  0.366307  0.611431    2  0.315545  0.27320  0.26076  0.32446  0.381398  0.373424  0.195709    3  0.391128  0.31796  0.32128  0.44467  0.327915  0.321570  0.605077    4  0.247408  0.24564  0.22089  0.21230  0.204687  0.202213  0.246011         cont14     loss   0  0.714843  2213.18   1  0.304496  1283.60   2  0.774425  3005.09   3  0.602642   939.85   4  0.432606  2763.85   [5 rows x 132 columns] ************************************** TEST DATA **************************************    id cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9    ...        cont5  \ 0   4    A    B    A    A    A    A    A    A    B    ...     0.281143    1   6    A    B    A    B    A    A    A    A    B    ...     0.836443    2   9    A    B    A    B    B    A    B    A    B    ...     0.718531    3  12    A    A    A    A    B    A    A    A    A    ...     0.397069    4  15    B    A    A    A    A    B    A    A    A    ...     0.302678          cont6     cont7    cont8    cont9   cont10    cont11    cont12  \ 0  0.466591  0.317681  0.61229  0.34365  0.38016  0.377724  0.369858    1  0.482425  0.443760  0.71330  0.51890  0.60401  0.689039  0.675759    2  0.212308  0.325779  0.29758  0.34365  0.30529  0.245410  0.241676    3  0.369930  0.342355  0.40028  0.33237  0.31480  0.348867  0.341872    4  0.398862  0.391833  0.23688  0.43731  0.50556  0.359572  0.352251         cont13    cont14   0  0.704052  0.392562   1  0.453468  0.208045   2  0.258586  0.297232   3  0.592264  0.555955   4  0.301535  0.825823   [5 rows x 131 columns] | 
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | ************************************** TRAIN DATA **************************************    id cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9 cat10 cat11 cat12 cat13  \ 0   1    A    B    A    B    A    A    A    A    B     A     B     A     A    1   2    A    B    A    A    A    A    A    A    B     B     A     A     A    2   5    A    B    A    A    B    A    A    A    B     B     B     B     B    3  10    B    B    A    B    A    A    A    A    B     A     A     A     A    4  11    A    B    A    B    A    A    A    A    B     B     A     B     A      cat14 cat15 cat16 cat17 cat18 cat19 cat20 cat21 cat22 cat23 cat24 cat25  \ 0     A     A     A     A     A     A     A     A     A     B     A     A    1     A     A     A     A     A     A     A     A     A     A     A     A    2     A     A     A     A     A     A     A     A     A     A     A     A    3     A     A     A     A     A     A     A     A     A     B     A     A    4     A     A     A     A     A     A     A     A     A     B     A     A      cat26 cat27 cat28 cat29 cat30 cat31 cat32 cat33 cat34 cat35 cat36 cat37  \ 0     A     A     A     A     A     A     A     A     A     A     A     A    1     A     A     A     A     A     A     A     A     A     A     A     A    2     A     A     A     A     A     A     A     A     A     A     B     A    3     A     A     A     A     A     A     A     A     A     A     A     A    4     A     A     A     A     A     A     A     A     A     A     A     A      cat38 cat39 cat40 cat41 cat42 cat43 cat44 cat45 cat46 cat47 cat48 cat49  \ 0     A     A     A     A     A     A     A     A     A     A     A     A    1     A     A     A     A     A     A     A     A     A     A     A     A    2     A     A     A     A     A     A     A     A     A     A     A     A    3     A     A     A     A     A     A     A     A     A     A     A     A    4     A     A     A     A     A     A     A     A     A     A     A     A      cat50 cat51 cat52 cat53 cat54 cat55 cat56 cat57 cat58 cat59 cat60 cat61  \ 0     A     A     A     A     A     A     A     A     A     A     A     A    1     A     A     A     A     A     A     A     A     A     A     A     A    2     A     A     A     A     A     A     A     A     A     A     A     A    3     A     A     A     A     A     A     A     A     A     A     A     A    4     A     A     A     A     A     A     A     A     A     A     A     A      cat62 cat63 cat64 cat65 cat66 cat67 cat68 cat69 cat70 cat71 cat72 cat73  \ 0     A     A     A     A     A     A     A     A     A     A     A     A    1     A     A     A     A     A     A     A     A     A     A     A     A    2     A     A     A     A     A     A     A     A     A     A     A     A    3     A     A     A     A     A     A     A     A     A     A     A     B    4     A     A     A     A     A     A     A     A     A     A     B     A      cat74 cat75 cat76 cat77 cat78 cat79 cat80 cat81 cat82 cat83 cat84 cat85  \ 0     A     B     A     D     B     B     D     D     B     D     C     B    1     A     A     A     D     B     B     D     D     A     B     C     B    2     A     A     A     D     B     B     B     D     B     D     C     B    3     A     A     A     D     B     B     D     D     D     B     C     B    4     A     A     A     D     B     D     B     D     B     B     C     B      cat86 cat87 cat88 cat89 cat90 cat91 cat92 cat93 cat94 cat95 cat96 cat97  \ 0     D     B     A     A     A     A     A     D     B     C     E     A    1     D     B     A     A     A     A     A     D     D     C     E     E    2     B     B     A     A     A     A     A     D     D     C     E     E    3     D     B     A     A     A     A     A     D     D     C     E     E    4     B     C     A     A     A     B     H     D     B     D     E     E      cat98 cat99 cat100 cat101 cat102 cat103 cat104 cat105 cat106 cat107 cat108  \ 0     C     T      B      G      A      A      I      E      G      J      G    1     D     T      L      F      A      A      E      E      I      K      K    2     A     D      L      O      A      B      E      F      H      F      A    3     D     T      I      D      A      A      E      E      I      K      K    4     A     P      F      J      A      A      D      E      K      G      B      cat109 cat110 cat111 cat112 cat113 cat114 cat115 cat116     cont1     cont2  \ 0     BU     BC      C     AS      S      A      O     LB  0.726300  0.245921    1     BI     CQ      A     AV     BM      A      O     DP  0.330514  0.737068    2     AB     DK      A      C     AF      A      I     GK  0.261841  0.358319    3     BI     CS      C      N     AE      A      O     DJ  0.321594  0.555782    4      H      C      C      Y     BM      A      K     CK  0.273204  0.159990          cont3     cont4     cont5     cont6     cont7    cont8    cont9  \ 0  0.187583  0.789639  0.310061  0.718367  0.335060  0.30260  0.67135    1  0.592681  0.614134  0.885834  0.438917  0.436585  0.60087  0.35127    2  0.484196  0.236924  0.397069  0.289648  0.315545  0.27320  0.26076    3  0.527991  0.373816  0.422268  0.440945  0.391128  0.31796  0.32128    4  0.527991  0.473202  0.704268  0.178193  0.247408  0.24564  0.22089        cont10    cont11    cont12    cont13    cont14     loss   0  0.83510  0.569745  0.594646  0.822493  0.714843  2213.18   1  0.43919  0.338312  0.366307  0.611431  0.304496  1283.60   2  0.32446  0.381398  0.373424  0.195709  0.774425  3005.09   3  0.44467  0.327915  0.321570  0.605077  0.602642   939.85   4  0.21230  0.204687  0.202213  0.246011  0.432606  2763.85   ************************************** TEST DATA **************************************    id cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9 cat10 cat11 cat12 cat13  \ 0   4    A    B    A    A    A    A    A    A    B     A     B     A     A    1   6    A    B    A    B    A    A    A    A    B     A     A     A     A    2   9    A    B    A    B    B    A    B    A    B     B     A     B     B    3  12    A    A    A    A    B    A    A    A    A     A     A     A     A    4  15    B    A    A    A    A    B    A    A    A     A     A     A     A      cat14 cat15 cat16 cat17 cat18 cat19 cat20 cat21 cat22 cat23 cat24 cat25  \ 0     A     A     A     A     A     A     A     A     A     A     A     A    1     A     A     A     A     A     A     A     A     A     B     B     A    2     B     A     A     A     A     A     A     A     A     B     A     A    3     A     A     A     A     A     A     A     A     A     A     A     A    4     A     A     A     A     A     A     A     A     A     A     A     A      cat26 cat27 cat28 cat29 cat30 cat31 cat32 cat33 cat34 cat35 cat36 cat37  \ 0     A     A     A     A     A     A     A     A     A     A     A     A    1     A     A     A     A     A     A     A     A     A     A     A     A    2     A     A     A     A     A     A     A     A     A     A     B     A    3     A     A     A     A     A     A     A     A     A     A     B     A    4     A     A     A     A     A     A     A     A     A     A     A     A      cat38 cat39 cat40 cat41 cat42 cat43 cat44 cat45 cat46 cat47 cat48 cat49  \ 0     A     A     A     A     A     A     A     A     A     A     A     A    1     A     A     A     A     A     A     A     A     A     A     A     A    2     B     B     A     A     A     A     A     A     A     A     A     A    3     B     A     A     B     A     A     A     A     A     A     A     A    4     A     A     A     A     A     A     A     A     A     A     A     A      cat50 cat51 cat52 cat53 cat54 cat55 cat56 cat57 cat58 cat59 cat60 cat61  \ 0     A     A     A     A     A     A     A     A     A     A     A     A    1     A     A     A     A     A     A     A     A     A     A     A     A    2     A     A     A     A     A     A     A     B     A     A     A     A    3     A     A     A     A     A     A     A     A     A     A     A     A    4     B     A     A     A     A     A     A     A     A     A     A     A      cat62 cat63 cat64 cat65 cat66 cat67 cat68 cat69 cat70 cat71 cat72 cat73  \ 0     A     A     A     A     A     A     A     A     A     A     A     A    1     A     A     A     A     A     A     A     A     A     A     B     A    2     A     A     A     A     A     A     A     A     A     A     A     A    3     A     A     A     A     A     A     A     A     A     A     B     A    4     A     A     A     A     A     A     A     A     A     A     A     A      cat74 cat75 cat76 cat77 cat78 cat79 cat80 cat81 cat82 cat83 cat84 cat85  \ 0     A     A     A     D     B     B     D     D     B     B     C     B    1     A     B     A     D     B     B     D     D     B     B     C     B    2     A     A     B     D     B     B     B     B     B     D     C     B    3     A     A     A     D     B     D     B     D     B     B     A     B    4     A     A     A     D     B     B     D     D     B     B     C     B      cat86 cat87 cat88 cat89 cat90 cat91 cat92 cat93 cat94 cat95 cat96 cat97  \ 0     D     B     A     A     A     A     A     D     C     C     E     C    1     B     B     A     A     A     A     A     D     D     D     E     A    2     B     B     A     B     A     A     A     D     D     C     E     E    3     D     D     A     A     A     G     H     D     D     C     E     E    4     B     B     A     A     A     A     A     D     B     D     E     A      cat98 cat99 cat100 cat101 cat102 cat103 cat104 cat105 cat106 cat107 cat108  \ 0     D     T      H      G      A      A      G      E      I      L      K    1     A     P      B      D      A      A      G      G      G      F      B    2     A     D      G      Q      A      D      D      E      J      G      A    3     D     T      G      A      A      D      E      E      I      K      K    4     A     P      A      A      A      A      F      E      G      E      B      cat109 cat110 cat111 cat112 cat113 cat114 cat115 cat116     cont1     cont2  \ 0     BI     BC      A      J     AX      A      Q     HG  0.321594  0.299102    1     BI     CO      E      G      X      A      L     HK  0.634734  0.620805    2     BI     CS      C      U     AE      A      K     CK  0.290813  0.737068    3     BI     CR      A     AY     AJ      A      P     DJ  0.268622  0.681761    4     AB     EG      A      E      I      C      J     HA  0.553846  0.299102          cont3     cont4     cont5     cont6     cont7    cont8    cont9  \ 0  0.246911  0.402922  0.281143  0.466591  0.317681  0.61229  0.34365    1  0.654310  0.946616  0.836443  0.482425  0.443760  0.71330  0.51890    2  0.711159  0.412789  0.718531  0.212308  0.325779  0.29758  0.34365    3  0.592681  0.354893  0.397069  0.369930  0.342355  0.40028  0.33237    4  0.263570  0.696873  0.302678  0.398862  0.391833  0.23688  0.43731        cont10    cont11    cont12    cont13    cont14   0  0.38016  0.377724  0.369858  0.704052  0.392562   1  0.60401  0.689039  0.675759  0.453468  0.208045   2  0.30529  0.245410  0.241676  0.258586  0.297232   3  0.31480  0.348867  0.341872  0.592264  0.555955   4  0.50556  0.359572  0.352251  0.301535  0.825823 | 
You might have noticed that we printed the same thing twice. Well, the first time Python prints a small number of columns and the first five observations but the second time it prints all columns and their 5 observations. This is because of the
| 1 2 | pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None) | 
Make sure you have the 5 in head else it will print everything on screen, which will not be pretty. Take a look at the columns present in the train and test set.
| 1 2 | print 'columns in train set : ', train.columns print 'columns in test set : ', test.columns | 
There is an ID column in both the data sets which we don’t need for any analysis. Moreover, we will keep the loss column from the training data set into a separate variable.
| 1 2 3 4 | # remove ID column. No use. train.drop('id',axis=1,inplace=True) test.drop('id',axis=1,inplace=True) loss = train.drop('loss', axis = 1, inplace = True) | 
Let’s take a look at the continuous variables and their basic statistical analysis.
| 1 2 3 4 5 | # high level statistics. mean media mode count and quartiles # note - this will work only for the continous variables # not for the categorical variables print train.describe() print test.describe() | 
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | ## train                cont1          cont2          cont3          cont4  \ count  188318.000000  188318.000000  188318.000000  188318.000000    mean        0.493861       0.507188       0.498918       0.491812    std         0.187640       0.207202       0.202105       0.211292    min         0.000016       0.001149       0.002634       0.176921    25%         0.346090       0.358319       0.336963       0.327354    50%         0.475784       0.555782       0.527991       0.452887    75%         0.623912       0.681761       0.634224       0.652072    max         0.984975       0.862654       0.944251       0.954297                   cont5          cont6          cont7          cont8  \ count  188318.000000  188318.000000  188318.000000  188318.000000    mean        0.487428       0.490945       0.484970       0.486437    std         0.209027       0.205273       0.178450       0.199370    min         0.281143       0.012683       0.069503       0.236880    25%         0.281143       0.336105       0.350175       0.312800    50%         0.422268       0.440945       0.438285       0.441060    75%         0.643315       0.655021       0.591045       0.623580    max         0.983674       0.997162       1.000000       0.980200                   cont9         cont10         cont11         cont12  \ count  188318.000000  188318.000000  188318.000000  188318.000000    mean        0.485506       0.498066       0.493511       0.493150    std         0.181660       0.185877       0.209737       0.209427    min         0.000080       0.000000       0.035321       0.036232    25%         0.358970       0.364580       0.310961       0.311661    50%         0.441450       0.461190       0.457203       0.462286    75%         0.566820       0.614590       0.678924       0.675759    max         0.995400       0.994980       0.998742       0.998484                  cont13         cont14   count  188318.000000  188318.000000   mean        0.493138       0.495717   std         0.212777       0.222488   min         0.000228       0.179722   25%         0.315758       0.294610   50%         0.363547       0.407403   75%         0.689974       0.724623   max         0.988494       0.844848 | 
In many competitions, you’ll find there are some features that might be present in the training set but not in the test set and vice-versa.
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | # at this point, it is wise to check whether there are any features that # are there is one of the dataset but not in other missingFeatures = False inTrainNotTest = [] for feature in train.columns:     if feature not in test.columns:         missingFeatures = True         inTrainNotTest.append(feature) if len(inTrainNotTest)>0:     print ', '. join(inTrainNotTest), ' features are present in training set but not in test set' inTestNotTrain = [] for feature in test.columns:     if feature not in train.columns:         missingFeatures = True         inTestNotTrain.append(feature) if len(inTestNotTrain)>0:     print ', '. join(inTestNotTrain), ' features are present in test set but not in training set' | 
In this case, we see that there are not different columns between train and test sets.
Let’s identify the categorical and continuous variables. For this data set, there are two ways to find them:
- variables have ‘cat’ and ‘cont’ in them, defining them
- pandas consider the data type as object
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | # find categorical variables # in this problem, categorical variables are start with cat which is easy # to identify # in other problems it not might be like that # we will see two ways to identify this in this problem # we will also find the continous or numerical variables ## 1. by name categorical_train = [var for var in train.columns if 'cat' in var] categorical_test = [var for var in test.columns if 'cat' in var] continous_train = [var for var in train.columns if 'cont' in var] continous_test = [var for var in test.columns if 'cont' in var] ## 2. by type = object categorical_train = train.dtypes[train.dtypes == "object"].index categorical_test = test.dtypes[test.dtypes == "object"].index continous_train = train.dtypes[train.dtypes != "object"].index continous_test = test.dtypes[test.dtypes != "object"].index | 
Correlation between continuous variables
Let’s take a look at correlation between the variables. The idea is to remove variables that are highly correlated.
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | # lets check for correlation between continous data # correlation between numerical variables is something like this # if we increase one variable, there is a siginficant almost increase/decrease # in the other variable. it varies from -1 to 1 correlation_train = train[continous_train].corr() correlation_test = train[continous_test].corr() # for the purpose of this analysis, we will consider to variables to # highly correlation if the correlation is more than 0.6 threshold = 0.6 for i in range(len(correlation_train)):     for j in range(len(correlation_train)):         if (i>j) and (correlation_train.iloc[i,j]>threshold):             print ("%s and %s = %.2f" % (train.columns[i],train.columns[j],correlation_train.iloc[i,j])) for i in range(len(correlation_test)):     for j in range(len(correlation_test)):         if (i>j) and (correlation_test.iloc[i,j]>threshold):             print ("%s and %s = %.2f" % (test.columns[i],test.columns[j],correlation_test.iloc[i,j])) # we can remove one of the two highly correlatied variables to improve performance | 
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | cat6 and cat1 = 0.76 cat7 and cat6 = 0.66 cat9 and cat1 = 0.93 cat9 and cat6 = 0.80 cat10 and cat1 = 0.81 cat10 and cat6 = 0.88 cat10 and cat9 = 0.79 cat11 and cat6 = 0.77 cat11 and cat7 = 0.75 cat11 and cat9 = 0.61 cat11 and cat10 = 0.70 cat12 and cat1 = 0.61 cat12 and cat6 = 0.79 cat12 and cat7 = 0.74 cat12 and cat9 = 0.63 cat12 and cat10 = 0.71 cat12 and cat11 = 0.99 cat13 and cat6 = 0.82 cat13 and cat9 = 0.64 cat13 and cat10 = 0.71 cat6 and cat1 = 0.76 cat7 and cat6 = 0.66 cat9 and cat1 = 0.93 cat9 and cat6 = 0.80 cat10 and cat1 = 0.81 cat10 and cat6 = 0.88 cat10 and cat9 = 0.79 cat11 and cat6 = 0.77 cat11 and cat7 = 0.75 cat11 and cat9 = 0.61 cat11 and cat10 = 0.70 cat12 and cat1 = 0.61 cat12 and cat6 = 0.79 cat12 and cat7 = 0.74 cat12 and cat9 = 0.63 cat12 and cat10 = 0.71 cat12 and cat11 = 0.99 cat13 and cat6 = 0.82 cat13 and cat9 = 0.64 cat13 and cat10 = 0.71 | 
Let’s take at the labels present in the categorical variables. Although we don’t have any different columns, what may happen is some labels might not be present in one or the other data set.
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | # lets check for factors in the categorical variables for feature in categorical_train:     print feature, 'has ', len(train[feature].unique()), 'values. Unique values are :: ', train[feature].unique() for feature in categorical_test:     print feature, 'has ', len(test[feature].unique()), 'values. Unique values are :: ', test[feature].unique() # lets take a look whether the unique values/factors are not present in each of the dataset # for example cat1 in both the datasets has values only A & B. Sometimes # it may happen that some new value is present in the test set which maybe ruin your model featuresDone = [] for feature in categorical_train:     if feature in categorical_test:                 if set(train[feature].unique()) - set(test[feature].unique()) != set([]):             print 'Train set has ', len(train[feature].unique()), 'values. Unique values are :: ', train[feature].unique(), '\n'             print 'test set has ', len(test[feature].unique()), 'values. Unique values are :: ', test[feature].unique(), '\n'             print 'Missing vaues are : ', set(train[feature].unique()) - set(test[feature].unique())         featuresDone.append(feature) for feature in categorical_test:     if (feature in categorical_train) and (feature not in featuresDone):                 if set(train[feature].unique()) - set(test[feature].unique()) != set([]):             print 'Train set has ', len(train[feature].unique()), 'values. Unique values are :: ', train[feature].unique(), '\n'             print 'test set has ', len(test[feature].unique()), 'values. Unique values are :: ', test[feature].unique(), '\n'             print 'Missing vaues are : ', set(train[feature].unique()) - set(test[feature].unique())         featuresDone.append(feature) | 
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 | cat1 has  2 values. Unique values are ::  ['A' 'B'] cat2 has  2 values. Unique values are ::  ['B' 'A'] cat3 has  2 values. Unique values are ::  ['A' 'B'] cat4 has  2 values. Unique values are ::  ['B' 'A'] cat5 has  2 values. Unique values are ::  ['A' 'B'] cat6 has  2 values. Unique values are ::  ['A' 'B'] cat7 has  2 values. Unique values are ::  ['A' 'B'] cat8 has  2 values. Unique values are ::  ['A' 'B'] cat9 has  2 values. Unique values are ::  ['B' 'A'] cat10 has  2 values. Unique values are ::  ['A' 'B'] cat11 has  2 values. Unique values are ::  ['B' 'A'] cat12 has  2 values. Unique values are ::  ['A' 'B'] cat13 has  2 values. Unique values are ::  ['A' 'B'] cat14 has  2 values. Unique values are ::  ['A' 'B'] cat15 has  2 values. Unique values are ::  ['A' 'B'] cat16 has  2 values. Unique values are ::  ['A' 'B'] cat17 has  2 values. Unique values are ::  ['A' 'B'] cat18 has  2 values. Unique values are ::  ['A' 'B'] cat19 has  2 values. Unique values are ::  ['A' 'B'] cat20 has  2 values. Unique values are ::  ['A' 'B'] cat21 has  2 values. Unique values are ::  ['A' 'B'] cat22 has  2 values. Unique values are ::  ['A' 'B'] cat23 has  2 values. Unique values are ::  ['B' 'A'] cat24 has  2 values. Unique values are ::  ['A' 'B'] cat25 has  2 values. Unique values are ::  ['A' 'B'] cat26 has  2 values. Unique values are ::  ['A' 'B'] cat27 has  2 values. Unique values are ::  ['A' 'B'] cat28 has  2 values. Unique values are ::  ['A' 'B'] cat29 has  2 values. Unique values are ::  ['A' 'B'] cat30 has  2 values. Unique values are ::  ['A' 'B'] cat31 has  2 values. Unique values are ::  ['A' 'B'] cat32 has  2 values. Unique values are ::  ['A' 'B'] cat33 has  2 values. Unique values are ::  ['A' 'B'] cat34 has  2 values. Unique values are ::  ['A' 'B'] cat35 has  2 values. Unique values are ::  ['A' 'B'] cat36 has  2 values. Unique values are ::  ['A' 'B'] cat37 has  2 values. Unique values are ::  ['A' 'B'] cat38 has  2 values. Unique values are ::  ['A' 'B'] cat39 has  2 values. Unique values are ::  ['A' 'B'] cat40 has  2 values. Unique values are ::  ['A' 'B'] cat41 has  2 values. Unique values are ::  ['A' 'B'] cat42 has  2 values. Unique values are ::  ['A' 'B'] cat43 has  2 values. Unique values are ::  ['A' 'B'] cat44 has  2 values. Unique values are ::  ['A' 'B'] cat45 has  2 values. Unique values are ::  ['A' 'B'] cat46 has  2 values. Unique values are ::  ['A' 'B'] cat47 has  2 values. Unique values are ::  ['A' 'B'] cat48 has  2 values. Unique values are ::  ['A' 'B'] cat49 has  2 values. Unique values are ::  ['A' 'B'] cat50 has  2 values. Unique values are ::  ['A' 'B'] cat51 has  2 values. Unique values are ::  ['A' 'B'] cat52 has  2 values. Unique values are ::  ['A' 'B'] cat53 has  2 values. Unique values are ::  ['A' 'B'] cat54 has  2 values. Unique values are ::  ['A' 'B'] cat55 has  2 values. Unique values are ::  ['A' 'B'] cat56 has  2 values. Unique values are ::  ['A' 'B'] cat57 has  2 values. Unique values are ::  ['A' 'B'] cat58 has  2 values. Unique values are ::  ['A' 'B'] cat59 has  2 values. Unique values are ::  ['A' 'B'] cat60 has  2 values. Unique values are ::  ['A' 'B'] cat61 has  2 values. Unique values are ::  ['A' 'B'] cat62 has  2 values. Unique values are ::  ['A' 'B'] cat63 has  2 values. Unique values are ::  ['A' 'B'] cat64 has  2 values. Unique values are ::  ['A' 'B'] cat65 has  2 values. Unique values are ::  ['A' 'B'] cat66 has  2 values. Unique values are ::  ['A' 'B'] cat67 has  2 values. Unique values are ::  ['A' 'B'] cat68 has  2 values. Unique values are ::  ['A' 'B'] cat69 has  2 values. Unique values are ::  ['A' 'B'] cat70 has  2 values. Unique values are ::  ['A' 'B'] cat71 has  2 values. Unique values are ::  ['A' 'B'] cat72 has  2 values. Unique values are ::  ['A' 'B'] cat73 has  3 values. Unique values are ::  ['A' 'B' 'C'] cat74 has  3 values. Unique values are ::  ['A' 'B' 'C'] cat75 has  3 values. Unique values are ::  ['B' 'A' 'C'] cat76 has  3 values. Unique values are ::  ['A' 'C' 'B'] cat77 has  4 values. Unique values are ::  ['D' 'C' 'B' 'A'] cat78 has  4 values. Unique values are ::  ['B' 'A' 'C' 'D'] cat79 has  4 values. Unique values are ::  ['B' 'D' 'A' 'C'] cat80 has  4 values. Unique values are ::  ['D' 'B' 'A' 'C'] cat81 has  4 values. Unique values are ::  ['D' 'B' 'A' 'C'] cat82 has  4 values. Unique values are ::  ['B' 'A' 'D' 'C'] cat83 has  4 values. Unique values are ::  ['D' 'B' 'A' 'C'] cat84 has  4 values. Unique values are ::  ['C' 'A' 'D' 'B'] cat85 has  4 values. Unique values are ::  ['B' 'A' 'C' 'D'] cat86 has  4 values. Unique values are ::  ['D' 'B' 'C' 'A'] cat87 has  4 values. Unique values are ::  ['B' 'C' 'D' 'A'] cat88 has  4 values. Unique values are ::  ['A' 'D' 'E' 'B'] cat89 has  8 values. Unique values are ::  ['A' 'B' 'C' 'E' 'D' 'H' 'I' 'G'] cat90 has  7 values. Unique values are ::  ['A' 'B' 'C' 'D' 'F' 'E' 'G'] cat91 has  8 values. Unique values are ::  ['A' 'B' 'G' 'C' 'D' 'E' 'F' 'H'] cat92 has  7 values. Unique values are ::  ['A' 'H' 'B' 'C' 'D' 'I' 'F'] cat93 has  5 values. Unique values are ::  ['D' 'C' 'A' 'B' 'E'] cat94 has  7 values. Unique values are ::  ['B' 'D' 'C' 'A' 'F' 'E' 'G'] cat95 has  5 values. Unique values are ::  ['C' 'D' 'E' 'A' 'B'] cat96 has  8 values. Unique values are ::  ['E' 'D' 'G' 'B' 'F' 'A' 'I' 'C'] cat97 has  7 values. Unique values are ::  ['A' 'E' 'C' 'G' 'D' 'F' 'B'] cat98 has  5 values. Unique values are ::  ['C' 'D' 'A' 'E' 'B'] cat99 has  16 values. Unique values are ::  ['T' 'D' 'P' 'S' 'R' 'K' 'E' 'F' 'N' 'J' 'C' 'M' 'H' 'G' 'I' 'O'] cat100 has  15 values. Unique values are ::  ['B' 'L' 'I' 'F' 'J' 'H' 'C' 'M' 'A' 'G' 'O' 'N' 'K' 'D' 'E'] cat101 has  19 values. Unique values are ::  ['G' 'F' 'O' 'D' 'J' 'A' 'C' 'Q' 'M' 'I' 'L' 'R' 'S' 'E' 'N' 'H' 'B' 'U'  'K'] cat102 has  9 values. Unique values are ::  ['A' 'C' 'B' 'D' 'G' 'E' 'F' 'H' 'J'] cat103 has  13 values. Unique values are ::  ['A' 'B' 'C' 'F' 'E' 'D' 'G' 'H' 'I' 'L' 'K' 'J' 'N'] cat104 has  17 values. Unique values are ::  ['I' 'E' 'D' 'K' 'H' 'F' 'G' 'P' 'C' 'J' 'L' 'M' 'N' 'O' 'B' 'A' 'Q'] cat105 has  20 values. Unique values are ::  ['E' 'F' 'H' 'G' 'I' 'D' 'J' 'K' 'M' 'C' 'A' 'L' 'N' 'P' 'T' 'Q' 'R' 'O'  'B' 'S'] cat106 has  17 values. Unique values are ::  ['G' 'I' 'H' 'K' 'F' 'J' 'E' 'L' 'M' 'D' 'A' 'C' 'N' 'O' 'R' 'B' 'P'] cat107 has  20 values. Unique values are ::  ['J' 'K' 'F' 'G' 'I' 'M' 'H' 'L' 'E' 'D' 'O' 'C' 'N' 'A' 'Q' 'P' 'U' 'B'  'R' 'S'] cat108 has  11 values. Unique values are ::  ['G' 'K' 'A' 'B' 'D' 'I' 'F' 'H' 'E' 'C' 'J'] cat109 has  84 values. Unique values are ::  ['BU' 'BI' 'AB' 'H' 'K' 'CD' 'BQ' 'M' 'G' 'BL' 'L' 'AL' 'N' 'CL' 'R' 'F'  'BJ' 'AR' 'AT' 'S' 'AS' 'BO' 'X' 'D' 'BM' 'I' 'BH' 'CI' 'CF' 'C' 'AM' 'U'  'BE' 'BR' 'CJ' 'AE' 'A' 'Q' 'AW' 'T' 'AJ' 'AH' 'BA' 'BV' 'CC' 'CA' 'BG'  'BB' 'O' 'BD' 'AV' 'AX' 'AQ' 'AA' 'AI' 'AU' 'BX' 'AP' 'CK' 'Y' 'CH' 'BS'  'AN' 'AO' 'BC' 'CE' 'E' 'BY' 'CB' 'BT' 'P' 'BK' 'AF' 'B' 'BF' 'CG' 'V'  'ZZ' 'AY' 'BP' 'BN' 'J' 'AG' 'AK'] cat110 has  131 values. Unique values are ::  ['BC' 'CQ' 'DK' 'CS' 'C' 'EB' 'DW' 'AM' 'AI' 'EG' 'CL' 'BS' 'BT' 'CO' 'CM'  'EL' 'AY' 'W' 'EE' 'AC' 'DX' 'CI' 'DT' 'A' 'V' 'DM' 'EF' 'DL' 'DA' 'BP'  'DH' 'CF' 'N' 'T' 'CR' 'X' 'CH' 'EM' 'DC' 'AX' 'BG' 'CJ' 'EA' 'AD' 'U'  'AK' 'BX' 'AW' 'G' 'BA' 'L' 'AP' 'CG' 'R' 'DU' 'I' 'AR' 'O' 'DF' 'AT' 'E'  'AB' 'AU' 'DI' 'CN' 'CP' 'AL' 'ED' 'DJ' 'AO' 'CY' 'BE' 'BJ' 'D' 'AA' 'CK'  'CV' 'BK' 'BB' 'AE' 'BO' 'P' 'DO' 'CT' 'AJ' 'BR' 'Y' 'DR' 'BQ' 'BL' 'B'  'BW' 'H' 'DP' 'DG' 'AG' 'BN' 'J' 'CW' 'DV' 'Q' 'DY' 'EI' 'AV' 'DQ' 'BU'  'K' 'BF' 'BD' 'DS' 'DE' 'BM' 'BY' 'CD' 'BI' 'DD' 'DB' 'AH' 'CC' 'DN' 'CU'  'BV' 'CX' 'AN' 'EK' 'EJ' 'AS' 'AF' 'CB' 'EH' 'S'] cat111 has  16 values. Unique values are ::  ['C' 'A' 'G' 'E' 'I' 'M' 'W' 'S' 'K' 'O' 'Q' 'U' 'F' 'B' 'Y' 'D'] cat112 has  51 values. Unique values are ::  ['AS' 'AV' 'C' 'N' 'Y' 'J' 'AH' 'K' 'U' 'E' 'AK' 'AI' 'AE' 'A' 'L' 'F' 'AP'  'AD' 'AF' 'AL' 'AN' 'S' 'AW' 'I' 'AR' 'AX' 'AU' 'AQ' 'O' 'AO' 'R' 'H' 'G'  'AC' 'AT' 'AG' 'X' 'AA' 'Q' 'AY' 'D' 'BA' 'P' 'B' 'AM' 'M' 'T' 'W' 'V'  'AB' 'AJ'] cat113 has  61 values. Unique values are ::  ['S' 'BM' 'AF' 'AE' 'Y' 'AX' 'H' 'K' 'L' 'A' 'J' 'AK' 'N' 'M' 'AJ' 'AT' 'F'  'BC' 'AY' 'AD' 'BG' 'BO' 'AS' 'BD' 'AN' 'I' 'BF' 'BK' 'AW' 'AG' 'BJ' 'AO'  'Q' 'AM' 'X' 'AU' 'BN' 'BH' 'AI' 'C' 'AV' 'AQ' 'AH' 'G' 'E' 'BA' 'AL' 'BI'  'U' 'AB' 'V' 'O' 'BB' 'AP' 'B' 'BL' 'BE' 'T' 'P' 'AC' 'AR'] cat114 has  19 values. Unique values are ::  ['A' 'J' 'E' 'C' 'F' 'L' 'N' 'I' 'R' 'U' 'O' 'B' 'Q' 'V' 'D' 'X' 'W' 'S'  'G'] cat115 has  23 values. Unique values are ::  ['O' 'I' 'K' 'P' 'Q' 'L' 'J' 'R' 'N' 'M' 'H' 'G' 'F' 'A' 'S' 'W' 'T' 'C'  'E' 'D' 'B' 'X' 'U'] cat116 has  326 values. Unique values are ::  ['LB' 'DP' 'GK' 'DJ' 'CK' 'LO' 'IE' 'LY' 'GS' 'HK' 'DC' 'MP' 'DS' 'LE' 'HQ'  'HJ' 'GC' 'BY' 'HX' 'HL' 'HG' 'MD' 'LF' 'LM' 'CB' 'CS' 'KQ' 'HN' 'LQ' 'KW'  'IT' 'LN' 'CW' 'LC' 'GX' 'GE' 'CP' 'HB' 'GI' 'GM' 'CR' 'JR' 'HA' 'EE' 'BA'  'LJ' 'IH' 'HV' 'GU' 'HM' 'CY' 'IC' 'KD' 'KI' 'DN' 'MG' 'LL' 'KN' 'LH' 'DF'  'EY' 'LW' 'KA' 'EK' 'DK' 'EO' 'CG' 'K' 'HC' 'DI' 'FB' 'IG' 'FR' 'CI' 'EC'  'KR' 'HI' 'IU' 'MC' 'BP' 'JW' 'FH' 'IF' 'E' 'DA' 'KL' 'LX' 'IL' 'KB' 'IQ'  'EL' 'JX' 'H' 'GN' 'CD' 'DH' 'AC' 'FD' 'ME' 'KC' 'FT' 'CT' 'DM' 'GL' 'ES'  'JL' 'BX' 'II' 'HP' 'ED' 'CU' 'EN' 'FG' 'MJ' 'KE' 'CF' 'EB' 'DD' 'EI' 'FX'  'EA' 'BO' 'KP' 'EP' 'FC' 'GB' 'JU' 'LV' 'CO' 'EF' 'BD' 'HW' 'LI' 'GT' 'HH'  'KJ' 'CN' 'B' 'FE' 'GA' 'FW' 'IY' 'MO' 'JG' 'ID' 'DX' 'FA' 'LA' 'HR' 'GJ'  'GO' 'KT' 'GW' 'U' 'MI' 'GP' 'F' 'DU' 'KM' 'BV' 'DT' 'IM' 'LD' 'GR' 'HD'  'BS' 'AJ' 'KX' 'LR' 'ML' 'KU' 'CE' 'IA' 'DE' 'R' 'AO' 'MU' 'AK' 'CX' 'HY'  'EH' 'MA' 'GH' 'LK' 'DL' 'AX' 'IN' 'BI' 'JM' 'JF' 'KK' 'DR' 'LT' 'GF' 'AW'  'KY' 'CA' 'MK' 'DV' 'EG' 'DW' 'MN' 'V' 'CM' 'GY' 'AF' 'JC' 'MR' 'JE' 'IP'  'KV' 'KH' 'BW' 'MQ' 'D' 'HF' 'CV' 'BL' 'FL' 'GV' 'CQ' 'BM' 'JB' 'J' 'FU'  'AG' 'EJ' 'CH' 'MW' 'X' 'DG' 'AV' 'EW' 'O' 'DO' 'BK' 'FS' 'T' 'CL' 'Y'  'JQ' 'I' 'AL' 'JJ' 'HT' 'FF' 'JA' 'GD' 'FV' 'BQ' 'M' 'S' 'EU' 'P' 'FJ'  'AR' 'LG' 'IR' 'GQ' 'MM' 'AY' 'MF' 'GG' 'KG' 'JD' 'L' 'KS' 'AH' 'JV' 'EV'  'CC' 'AB' 'FK' 'JY' 'G' 'W' 'BC' 'AM' 'KF' 'LU' 'IK' 'BU' 'AT' 'JP' 'Q'  'IJ' 'JO' 'JH' 'AS' 'JN' 'BF' 'AD' 'FP' 'MV' 'AA' 'CJ' 'DY' 'IB' 'AN' 'EQ'  'JT' 'BG' 'AP' 'MB' 'JK' 'FI' 'MS' 'HE' 'C' 'IV' 'IO' 'BT' 'DQ' 'FM' 'HO'  'MH' 'MT' 'FO' 'JI' 'FQ' 'AU' 'FN' 'BB' 'HU' 'IX' 'AE'] cat1 has  2 values. Unique values are ::  ['A' 'B'] cat2 has  2 values. Unique values are ::  ['B' 'A'] cat3 has  2 values. Unique values are ::  ['A' 'B'] cat4 has  2 values. Unique values are ::  ['A' 'B'] cat5 has  2 values. Unique values are ::  ['A' 'B'] cat6 has  2 values. Unique values are ::  ['A' 'B'] cat7 has  2 values. Unique values are ::  ['A' 'B'] cat8 has  2 values. Unique values are ::  ['A' 'B'] cat9 has  2 values. Unique values are ::  ['B' 'A'] cat10 has  2 values. Unique values are ::  ['A' 'B'] cat11 has  2 values. Unique values are ::  ['B' 'A'] cat12 has  2 values. Unique values are ::  ['A' 'B'] cat13 has  2 values. Unique values are ::  ['A' 'B'] cat14 has  2 values. Unique values are ::  ['A' 'B'] cat15 has  2 values. Unique values are ::  ['A' 'B'] cat16 has  2 values. Unique values are ::  ['A' 'B'] cat17 has  2 values. Unique values are ::  ['A' 'B'] cat18 has  2 values. Unique values are ::  ['A' 'B'] cat19 has  2 values. Unique values are ::  ['A' 'B'] cat20 has  2 values. Unique values are ::  ['A' 'B'] cat21 has  2 values. Unique values are ::  ['A' 'B'] cat22 has  2 values. Unique values are ::  ['A' 'B'] cat23 has  2 values. Unique values are ::  ['A' 'B'] cat24 has  2 values. Unique values are ::  ['A' 'B'] cat25 has  2 values. Unique values are ::  ['A' 'B'] cat26 has  2 values. Unique values are ::  ['A' 'B'] cat27 has  2 values. Unique values are ::  ['A' 'B'] cat28 has  2 values. Unique values are ::  ['A' 'B'] cat29 has  2 values. Unique values are ::  ['A' 'B'] cat30 has  2 values. Unique values are ::  ['A' 'B'] cat31 has  2 values. Unique values are ::  ['A' 'B'] cat32 has  2 values. Unique values are ::  ['A' 'B'] cat33 has  2 values. Unique values are ::  ['A' 'B'] cat34 has  2 values. Unique values are ::  ['A' 'B'] cat35 has  2 values. Unique values are ::  ['A' 'B'] cat36 has  2 values. Unique values are ::  ['A' 'B'] cat37 has  2 values. Unique values are ::  ['A' 'B'] cat38 has  2 values. Unique values are ::  ['A' 'B'] cat39 has  2 values. Unique values are ::  ['A' 'B'] cat40 has  2 values. Unique values are ::  ['A' 'B'] cat41 has  2 values. Unique values are ::  ['A' 'B'] cat42 has  2 values. Unique values are ::  ['A' 'B'] cat43 has  2 values. Unique values are ::  ['A' 'B'] cat44 has  2 values. Unique values are ::  ['A' 'B'] cat45 has  2 values. Unique values are ::  ['A' 'B'] cat46 has  2 values. Unique values are ::  ['A' 'B'] cat47 has  2 values. Unique values are ::  ['A' 'B'] cat48 has  2 values. Unique values are ::  ['A' 'B'] cat49 has  2 values. Unique values are ::  ['A' 'B'] cat50 has  2 values. Unique values are ::  ['A' 'B'] cat51 has  2 values. Unique values are ::  ['A' 'B'] cat52 has  2 values. Unique values are ::  ['A' 'B'] cat53 has  2 values. Unique values are ::  ['A' 'B'] cat54 has  2 values. Unique values are ::  ['A' 'B'] cat55 has  2 values. Unique values are ::  ['A' 'B'] cat56 has  2 values. Unique values are ::  ['A' 'B'] cat57 has  2 values. Unique values are ::  ['A' 'B'] cat58 has  2 values. Unique values are ::  ['A' 'B'] cat59 has  2 values. Unique values are ::  ['A' 'B'] cat60 has  2 values. Unique values are ::  ['A' 'B'] cat61 has  2 values. Unique values are ::  ['A' 'B'] cat62 has  2 values. Unique values are ::  ['A' 'B'] cat63 has  2 values. Unique values are ::  ['A' 'B'] cat64 has  2 values. Unique values are ::  ['A' 'B'] cat65 has  2 values. Unique values are ::  ['A' 'B'] cat66 has  2 values. Unique values are ::  ['A' 'B'] cat67 has  2 values. Unique values are ::  ['A' 'B'] cat68 has  2 values. Unique values are ::  ['A' 'B'] cat69 has  2 values. Unique values are ::  ['A' 'B'] cat70 has  2 values. Unique values are ::  ['A' 'B'] cat71 has  2 values. Unique values are ::  ['A' 'B'] cat72 has  2 values. Unique values are ::  ['A' 'B'] cat73 has  3 values. Unique values are ::  ['A' 'B' 'C'] cat74 has  3 values. Unique values are ::  ['A' 'B' 'C'] cat75 has  3 values. Unique values are ::  ['A' 'B' 'C'] cat76 has  3 values. Unique values are ::  ['A' 'B' 'C'] cat77 has  4 values. Unique values are ::  ['D' 'C' 'B' 'A'] cat78 has  4 values. Unique values are ::  ['B' 'D' 'C' 'A'] cat79 has  4 values. Unique values are ::  ['B' 'D' 'A' 'C'] cat80 has  4 values. Unique values are ::  ['D' 'B' 'C' 'A'] cat81 has  4 values. Unique values are ::  ['D' 'B' 'C' 'A'] cat82 has  4 values. Unique values are ::  ['B' 'A' 'D' 'C'] cat83 has  4 values. Unique values are ::  ['B' 'D' 'A' 'C'] cat84 has  4 values. Unique values are ::  ['C' 'A' 'D' 'B'] cat85 has  4 values. Unique values are ::  ['B' 'D' 'C' 'A'] cat86 has  4 values. Unique values are ::  ['D' 'B' 'C' 'A'] cat87 has  4 values. Unique values are ::  ['B' 'D' 'C' 'A'] cat88 has  4 values. Unique values are ::  ['A' 'D' 'E' 'B'] cat89 has  8 values. Unique values are ::  ['A' 'B' 'D' 'C' 'F' 'H' 'E' 'G'] cat90 has  6 values. Unique values are ::  ['A' 'B' 'C' 'D' 'F' 'E'] cat91 has  8 values. Unique values are ::  ['A' 'G' 'B' 'C' 'E' 'D' 'F' 'H'] cat92 has  8 values. Unique values are ::  ['A' 'H' 'B' 'C' 'G' 'I' 'D' 'E'] cat93 has  5 values. Unique values are ::  ['D' 'E' 'C' 'B' 'A'] cat94 has  7 values. Unique values are ::  ['C' 'D' 'B' 'E' 'F' 'A' 'G'] cat95 has  5 values. Unique values are ::  ['C' 'D' 'E' 'A' 'B'] cat96 has  9 values. Unique values are ::  ['E' 'B' 'G' 'D' 'F' 'I' 'A' 'C' 'H'] cat97 has  7 values. Unique values are ::  ['C' 'A' 'E' 'G' 'D' 'F' 'B'] cat98 has  5 values. Unique values are ::  ['D' 'A' 'C' 'E' 'B'] cat99 has  17 values. Unique values are ::  ['T' 'P' 'D' 'H' 'R' 'F' 'K' 'S' 'N' 'C' 'E' 'J' 'I' 'G' 'M' 'U' 'O'] cat100 has  15 values. Unique values are ::  ['H' 'B' 'G' 'A' 'F' 'I' 'L' 'K' 'J' 'N' 'O' 'M' 'D' 'C' 'E'] cat101 has  17 values. Unique values are ::  ['G' 'D' 'Q' 'A' 'F' 'M' 'L' 'O' 'C' 'I' 'J' 'S' 'R' 'E' 'B' 'H' 'K'] cat102 has  7 values. Unique values are ::  ['A' 'C' 'B' 'E' 'D' 'G' 'F'] cat103 has  14 values. Unique values are ::  ['A' 'D' 'C' 'B' 'E' 'F' 'G' 'I' 'H' 'K' 'J' 'M' 'L' 'N'] cat104 has  17 values. Unique values are ::  ['G' 'D' 'E' 'F' 'H' 'K' 'I' 'O' 'L' 'C' 'J' 'M' 'N' 'P' 'B' 'A' 'Q'] cat105 has  18 values. Unique values are ::  ['E' 'G' 'F' 'H' 'I' 'D' 'J' 'A' 'L' 'C' 'K' 'N' 'M' 'P' 'O' 'T' 'B' 'Q'] cat106 has  18 values. Unique values are ::  ['I' 'G' 'J' 'D' 'F' 'K' 'H' 'E' 'L' 'M' 'A' 'O' 'C' 'N' 'R' 'B' 'Q' 'P'] cat107 has  20 values. Unique values are ::  ['L' 'F' 'G' 'K' 'E' 'D' 'C' 'M' 'H' 'I' 'J' 'A' 'O' 'S' 'P' 'N' 'Q' 'U'  'R' 'B'] cat108 has  11 values. Unique values are ::  ['K' 'B' 'A' 'G' 'D' 'F' 'E' 'H' 'J' 'I' 'C'] cat109 has  74 values. Unique values are ::  ['BI' 'AB' 'K' 'G' 'BU' 'M' 'I' 'O' 'BO' 'CD' 'T' 'BQ' 'R' 'X' 'AR' 'E'  'BL' 'CI' 'S' 'AL' 'BH' 'N' 'U' 'F' 'AS' 'AQ' 'AW' 'CC' 'AN' 'AJ' 'C' 'AT'  'D' 'H' 'CA' 'A' 'AX' 'L' 'BD' 'V' 'BX' 'AH' 'CL' 'AM' 'BA' 'BR' 'AO' 'AE'  'AY' 'BB' 'BJ' 'AP' 'BN' 'AI' 'Q' 'BS' 'CK' 'AU' 'CE' 'BC' 'BG' 'AD' 'Y'  'BK' 'AA' 'CG' 'AV' 'P' 'AF' 'CB' 'CF' 'BE' 'CH' 'ZZ'] cat110 has  123 values. Unique values are ::  ['BC' 'CO' 'CS' 'CR' 'EG' 'CL' 'EL' 'BT' 'EB' 'CQ' 'BS' 'C' 'W' 'DX' 'CM'  'A' 'EF' 'CI' 'DL' 'AI' 'BP' 'N' 'DJ' 'CT' 'E' 'DW' 'CH' 'V' 'AM' 'DK'  'EA' 'BR' 'DR' 'D' 'EE' 'T' 'AP' 'I' 'AC' 'CY' 'DM' 'AL' 'CK' 'AD' 'AY'  'CF' 'CD' 'BG' 'AK' 'DA' 'DC' 'DQ' 'BA' 'U' 'CX' 'BJ' 'AV' 'AR' 'K' 'CG'  'DT' 'CN' 'O' 'BO' 'DU' 'CJ' 'AX' 'DH' 'BX' 'AH' 'AU' 'AB' 'BV' 'EM' 'L'  'BH' 'DI' 'DB' 'DE' 'CV' 'DO' 'BQ' 'AW' 'AJ' 'J' 'CU' 'P' 'CP' 'DS' 'BL'  'AO' 'AA' 'DF' 'DG' 'CC' 'X' 'BF' 'AE' 'BU' 'AT' 'BB' 'B' 'ED' 'Y' 'G'  'BE' 'DD' 'DY' 'DP' 'R' 'CW' 'DN' 'AG' 'BW' 'BY' 'EK' 'CA' 'AS' 'EJ' 'BM'  'Q' 'S' 'EN'] cat111 has  16 values. Unique values are ::  ['A' 'E' 'C' 'G' 'K' 'I' 'Q' 'U' 'M' 'O' 'S' 'F' 'L' 'W' 'Y' 'B'] cat112 has  51 values. Unique values are ::  ['J' 'G' 'U' 'AY' 'E' 'AN' 'AG' 'R' 'N' 'AV' 'AW' 'AS' 'AJ' 'AU' 'T' 'AH'  'AK' 'AF' 'D' 'L' 'AP' 'AI' 'K' 'A' 'AM' 'AT' 'AO' 'O' 'F' 'AD' 'C' 'S'  'AC' 'AA' 'X' 'Y' 'AE' 'AL' 'W' 'Q' 'I' 'B' 'M' 'AR' 'BA' 'AX' 'H' 'V'  'AB' 'P' 'AQ'] cat113 has  60 values. Unique values are ::  ['AX' 'X' 'AE' 'AJ' 'I' 'BC' 'S' 'Y' 'L' 'A' 'AO' 'AN' 'N' 'BM' 'AK' 'Q'  'BK' 'J' 'M' 'AV' 'H' 'AD' 'AS' 'AW' 'BN' 'K' 'AG' 'BJ' 'F' 'BG' 'AF' 'AU'  'BO' 'AT' 'BH' 'BD' 'AI' 'AY' 'BF' 'AM' 'E' 'AH' 'C' 'BI' 'AB' 'BA' 'BB'  'O' 'B' 'AQ' 'V' 'BL' 'G' 'AP' 'U' 'AA' 'R' 'AR' 'AL' 'P'] cat114 has  18 values. Unique values are ::  ['A' 'C' 'E' 'N' 'I' 'O' 'F' 'J' 'R' 'L' 'U' 'V' 'Q' 'B' 'W' 'G' 'D' 'S'] cat115 has  23 values. Unique values are ::  ['Q' 'L' 'K' 'P' 'J' 'I' 'H' 'O' 'M' 'N' 'R' 'G' 'S' 'A' 'F' 'T' 'U' 'X'  'W' 'D' 'C' 'E' 'B'] cat116 has  311 values. Unique values are ::  ['HG' 'HK' 'CK' 'DJ' 'HA' 'HY' 'MD' 'KC' 'GC' 'DT' 'HX' 'GE' 'HV' 'HJ' 'DA'  'HL' 'KB' 'JR' 'EP' 'DF' 'DP' 'LN' 'IE' 'GK' 'KW' 'CD' 'CR' 'CG' 'GS' 'LF'  'IF' 'HQ' 'FB' 'LL' 'LQ' 'JE' 'GL' 'LM' 'LB' 'LO' 'DC' 'HB' 'GT' 'CS' 'GX'  'BD' 'CI' 'IC' 'CW' 'EC' 'CH' 'KI' 'MG' 'JW' 'JU' 'HM' 'IT' 'IH' 'IG' 'LY'  'MC' 'EL' 'FH' 'MO' 'KD' 'GU' 'MJ' 'KA' 'FD' 'HH' 'DK' 'AC' 'GI' 'LW' 'BY'  'HN' 'CU' 'BU' 'BO' 'GM' 'KU' 'FR' 'EO' 'CN' 'EI' 'HC' 'LI' 'DS' 'EA' 'ME'  'E' 'GA' 'CB' 'LV' 'CP' 'GN' 'KL' 'CX' 'DH' 'CA' 'BV' 'BX' 'JL' 'KJ' 'EF'  'DD' 'AQ' 'FC' 'GP' 'LX' 'FT' 'HP' 'CM' 'BP' 'CO' 'GJ' 'KR' 'JX' 'KN' 'KP'  'K' 'IU' 'EK' 'LC' 'DO' 'LJ' 'R' 'LT' 'FU' 'KX' 'LD' 'HW' 'DI' 'GW' 'EE'  'GB' 'L' 'KQ' 'BQ' 'EY' 'FE' 'MP' 'MK' 'KS' 'DN' 'LA' 'EN' 'DM' 'AF' 'HD'  'FX' 'FG' 'CQ' 'IM' 'AW' 'EH' 'LK' 'IN' 'DG' 'JC' 'B' 'MU' 'FF' 'KT' 'CT'  'GR' 'IL' 'IQ' 'MI' 'GY' 'MQ' 'AO' 'FA' 'ED' 'I' 'DW' 'AX' 'DU' 'ES' 'EJ'  'HI' 'EB' 'GO' 'LG' 'LE' 'MN' 'BK' 'CL' 'ML' 'IY' 'JM' 'H' 'MA' 'EM' 'AK'  'KE' 'CF' 'HF' 'AJ' 'II' 'Y' 'DX' 'ID' 'GV' 'EW' 'KK' 'HR' 'CV' 'DR' 'IP'  'LH' 'MM' 'BS' 'FW' 'AR' 'GG' 'EG' 'MW' 'KM' 'DL' 'MS' 'JY' 'FP' 'JF' 'BW'  'KY' 'FY' 'GD' 'S' 'CE' 'GH' 'AN' 'KV' 'DE' 'GF' 'AI' 'HT' 'IA' 'BA' 'LR'  'N' 'JP' 'EU' 'JQ' 'BC' 'U' 'MR' 'JG' 'T' 'J' 'BG' 'BM' 'KF' 'IR' 'ET' 'Q'  'MV' 'KO' 'HE' 'JA' 'FK' 'KG' 'FV' 'O' 'BJ' 'JH' 'JV' 'JB' 'IW' 'AD' 'BT'  'F' 'AU' 'IJ' 'AE' 'IV' 'AA' 'DB' 'G' 'JK' 'JJ' 'LP' 'CJ' 'MX' 'BR' 'AV'  'BH' 'JS' 'FQ' 'M' 'FM' 'KH' 'ER' 'AG' 'A' 'AL' 'FL' 'BN' 'BE' 'IS' 'DV'  'FJ' 'CY' 'MH' 'LU' 'BB' 'LS' 'D' 'HS' 'FI' 'EX'] Train set has  8 values. Unique values are ::  ['A' 'B' 'C' 'E' 'D' 'H' 'I' 'G'] test set has  8 values. Unique values are ::  ['A' 'B' 'D' 'C' 'F' 'H' 'E' 'G'] Missing vaues are :  set(['I']) Train set has  7 values. Unique values are ::  ['A' 'B' 'C' 'D' 'F' 'E' 'G'] test set has  6 values. Unique values are ::  ['A' 'B' 'C' 'D' 'F' 'E'] Missing vaues are :  set(['G']) Train set has  7 values. Unique values are ::  ['A' 'H' 'B' 'C' 'D' 'I' 'F'] test set has  8 values. Unique values are ::  ['A' 'H' 'B' 'C' 'G' 'I' 'D' 'E'] Missing vaues are :  set(['F']) Train set has  19 values. Unique values are ::  ['G' 'F' 'O' 'D' 'J' 'A' 'C' 'Q' 'M' 'I' 'L' 'R' 'S' 'E' 'N' 'H' 'B' 'U'  'K'] test set has  17 values. Unique values are ::  ['G' 'D' 'Q' 'A' 'F' 'M' 'L' 'O' 'C' 'I' 'J' 'S' 'R' 'E' 'B' 'H' 'K'] Missing vaues are :  set(['U', 'N']) Train set has  9 values. Unique values are ::  ['A' 'C' 'B' 'D' 'G' 'E' 'F' 'H' 'J'] test set has  7 values. Unique values are ::  ['A' 'C' 'B' 'E' 'D' 'G' 'F'] Missing vaues are :  set(['H', 'J']) Train set has  20 values. Unique values are ::  ['E' 'F' 'H' 'G' 'I' 'D' 'J' 'K' 'M' 'C' 'A' 'L' 'N' 'P' 'T' 'Q' 'R' 'O'  'B' 'S'] test set has  18 values. Unique values are ::  ['E' 'G' 'F' 'H' 'I' 'D' 'J' 'A' 'L' 'C' 'K' 'N' 'M' 'P' 'O' 'T' 'B' 'Q'] Missing vaues are :  set(['S', 'R']) Train set has  84 values. Unique values are ::  ['BU' 'BI' 'AB' 'H' 'K' 'CD' 'BQ' 'M' 'G' 'BL' 'L' 'AL' 'N' 'CL' 'R' 'F'  'BJ' 'AR' 'AT' 'S' 'AS' 'BO' 'X' 'D' 'BM' 'I' 'BH' 'CI' 'CF' 'C' 'AM' 'U'  'BE' 'BR' 'CJ' 'AE' 'A' 'Q' 'AW' 'T' 'AJ' 'AH' 'BA' 'BV' 'CC' 'CA' 'BG'  'BB' 'O' 'BD' 'AV' 'AX' 'AQ' 'AA' 'AI' 'AU' 'BX' 'AP' 'CK' 'Y' 'CH' 'BS'  'AN' 'AO' 'BC' 'CE' 'E' 'BY' 'CB' 'BT' 'P' 'BK' 'AF' 'B' 'BF' 'CG' 'V'  'ZZ' 'AY' 'BP' 'BN' 'J' 'AG' 'AK'] test set has  74 values. Unique values are ::  ['BI' 'AB' 'K' 'G' 'BU' 'M' 'I' 'O' 'BO' 'CD' 'T' 'BQ' 'R' 'X' 'AR' 'E'  'BL' 'CI' 'S' 'AL' 'BH' 'N' 'U' 'F' 'AS' 'AQ' 'AW' 'CC' 'AN' 'AJ' 'C' 'AT'  'D' 'H' 'CA' 'A' 'AX' 'L' 'BD' 'V' 'BX' 'AH' 'CL' 'AM' 'BA' 'BR' 'AO' 'AE'  'AY' 'BB' 'BJ' 'AP' 'BN' 'AI' 'Q' 'BS' 'CK' 'AU' 'CE' 'BC' 'BG' 'AD' 'Y'  'BK' 'AA' 'CG' 'AV' 'P' 'AF' 'CB' 'CF' 'BE' 'CH' 'ZZ'] Missing vaues are :  set(['CJ', 'BF', 'B', 'AG', 'BM', 'AK', 'J', 'BT', 'BV', 'BP', 'BY']) Train set has  131 values. Unique values are ::  ['BC' 'CQ' 'DK' 'CS' 'C' 'EB' 'DW' 'AM' 'AI' 'EG' 'CL' 'BS' 'BT' 'CO' 'CM'  'EL' 'AY' 'W' 'EE' 'AC' 'DX' 'CI' 'DT' 'A' 'V' 'DM' 'EF' 'DL' 'DA' 'BP'  'DH' 'CF' 'N' 'T' 'CR' 'X' 'CH' 'EM' 'DC' 'AX' 'BG' 'CJ' 'EA' 'AD' 'U'  'AK' 'BX' 'AW' 'G' 'BA' 'L' 'AP' 'CG' 'R' 'DU' 'I' 'AR' 'O' 'DF' 'AT' 'E'  'AB' 'AU' 'DI' 'CN' 'CP' 'AL' 'ED' 'DJ' 'AO' 'CY' 'BE' 'BJ' 'D' 'AA' 'CK'  'CV' 'BK' 'BB' 'AE' 'BO' 'P' 'DO' 'CT' 'AJ' 'BR' 'Y' 'DR' 'BQ' 'BL' 'B'  'BW' 'H' 'DP' 'DG' 'AG' 'BN' 'J' 'CW' 'DV' 'Q' 'DY' 'EI' 'AV' 'DQ' 'BU'  'K' 'BF' 'BD' 'DS' 'DE' 'BM' 'BY' 'CD' 'BI' 'DD' 'DB' 'AH' 'CC' 'DN' 'CU'  'BV' 'CX' 'AN' 'EK' 'EJ' 'AS' 'AF' 'CB' 'EH' 'S'] test set has  123 values. Unique values are ::  ['BC' 'CO' 'CS' 'CR' 'EG' 'CL' 'EL' 'BT' 'EB' 'CQ' 'BS' 'C' 'W' 'DX' 'CM'  'A' 'EF' 'CI' 'DL' 'AI' 'BP' 'N' 'DJ' 'CT' 'E' 'DW' 'CH' 'V' 'AM' 'DK'  'EA' 'BR' 'DR' 'D' 'EE' 'T' 'AP' 'I' 'AC' 'CY' 'DM' 'AL' 'CK' 'AD' 'AY'  'CF' 'CD' 'BG' 'AK' 'DA' 'DC' 'DQ' 'BA' 'U' 'CX' 'BJ' 'AV' 'AR' 'K' 'CG'  'DT' 'CN' 'O' 'BO' 'DU' 'CJ' 'AX' 'DH' 'BX' 'AH' 'AU' 'AB' 'BV' 'EM' 'L'  'BH' 'DI' 'DB' 'DE' 'CV' 'DO' 'BQ' 'AW' 'AJ' 'J' 'CU' 'P' 'CP' 'DS' 'BL'  'AO' 'AA' 'DF' 'DG' 'CC' 'X' 'BF' 'AE' 'BU' 'AT' 'BB' 'B' 'ED' 'Y' 'G'  'BE' 'DD' 'DY' 'DP' 'R' 'CW' 'DN' 'AG' 'BW' 'BY' 'EK' 'CA' 'AS' 'EJ' 'BM'  'Q' 'S' 'EN'] Missing vaues are :  set(['BD', 'EI', 'EH', 'AF', 'H', 'BN', 'BI', 'BK', 'CB', 'DV', 'AN']) Train set has  16 values. Unique values are ::  ['C' 'A' 'G' 'E' 'I' 'M' 'W' 'S' 'K' 'O' 'Q' 'U' 'F' 'B' 'Y' 'D'] test set has  16 values. Unique values are ::  ['A' 'E' 'C' 'G' 'K' 'I' 'Q' 'U' 'M' 'O' 'S' 'F' 'L' 'W' 'Y' 'B'] Missing vaues are :  set(['D']) Train set has  61 values. Unique values are ::  ['S' 'BM' 'AF' 'AE' 'Y' 'AX' 'H' 'K' 'L' 'A' 'J' 'AK' 'N' 'M' 'AJ' 'AT' 'F'  'BC' 'AY' 'AD' 'BG' 'BO' 'AS' 'BD' 'AN' 'I' 'BF' 'BK' 'AW' 'AG' 'BJ' 'AO'  'Q' 'AM' 'X' 'AU' 'BN' 'BH' 'AI' 'C' 'AV' 'AQ' 'AH' 'G' 'E' 'BA' 'AL' 'BI'  'U' 'AB' 'V' 'O' 'BB' 'AP' 'B' 'BL' 'BE' 'T' 'P' 'AC' 'AR'] test set has  60 values. Unique values are ::  ['AX' 'X' 'AE' 'AJ' 'I' 'BC' 'S' 'Y' 'L' 'A' 'AO' 'AN' 'N' 'BM' 'AK' 'Q'  'BK' 'J' 'M' 'AV' 'H' 'AD' 'AS' 'AW' 'BN' 'K' 'AG' 'BJ' 'F' 'BG' 'AF' 'AU'  'BO' 'AT' 'BH' 'BD' 'AI' 'AY' 'BF' 'AM' 'E' 'AH' 'C' 'BI' 'AB' 'BA' 'BB'  'O' 'B' 'AQ' 'V' 'BL' 'G' 'AP' 'U' 'AA' 'R' 'AR' 'AL' 'P'] Missing vaues are :  set(['BE', 'AC', 'T']) Train set has  19 values. Unique values are ::  ['A' 'J' 'E' 'C' 'F' 'L' 'N' 'I' 'R' 'U' 'O' 'B' 'Q' 'V' 'D' 'X' 'W' 'S'  'G'] test set has  18 values. Unique values are ::  ['A' 'C' 'E' 'N' 'I' 'O' 'F' 'J' 'R' 'L' 'U' 'V' 'Q' 'B' 'W' 'G' 'D' 'S'] Missing vaues are :  set(['X']) Train set has  326 values. Unique values are ::  ['LB' 'DP' 'GK' 'DJ' 'CK' 'LO' 'IE' 'LY' 'GS' 'HK' 'DC' 'MP' 'DS' 'LE' 'HQ'  'HJ' 'GC' 'BY' 'HX' 'HL' 'HG' 'MD' 'LF' 'LM' 'CB' 'CS' 'KQ' 'HN' 'LQ' 'KW'  'IT' 'LN' 'CW' 'LC' 'GX' 'GE' 'CP' 'HB' 'GI' 'GM' 'CR' 'JR' 'HA' 'EE' 'BA'  'LJ' 'IH' 'HV' 'GU' 'HM' 'CY' 'IC' 'KD' 'KI' 'DN' 'MG' 'LL' 'KN' 'LH' 'DF'  'EY' 'LW' 'KA' 'EK' 'DK' 'EO' 'CG' 'K' 'HC' 'DI' 'FB' 'IG' 'FR' 'CI' 'EC'  'KR' 'HI' 'IU' 'MC' 'BP' 'JW' 'FH' 'IF' 'E' 'DA' 'KL' 'LX' 'IL' 'KB' 'IQ'  'EL' 'JX' 'H' 'GN' 'CD' 'DH' 'AC' 'FD' 'ME' 'KC' 'FT' 'CT' 'DM' 'GL' 'ES'  'JL' 'BX' 'II' 'HP' 'ED' 'CU' 'EN' 'FG' 'MJ' 'KE' 'CF' 'EB' 'DD' 'EI' 'FX'  'EA' 'BO' 'KP' 'EP' 'FC' 'GB' 'JU' 'LV' 'CO' 'EF' 'BD' 'HW' 'LI' 'GT' 'HH'  'KJ' 'CN' 'B' 'FE' 'GA' 'FW' 'IY' 'MO' 'JG' 'ID' 'DX' 'FA' 'LA' 'HR' 'GJ'  'GO' 'KT' 'GW' 'U' 'MI' 'GP' 'F' 'DU' 'KM' 'BV' 'DT' 'IM' 'LD' 'GR' 'HD'  'BS' 'AJ' 'KX' 'LR' 'ML' 'KU' 'CE' 'IA' 'DE' 'R' 'AO' 'MU' 'AK' 'CX' 'HY'  'EH' 'MA' 'GH' 'LK' 'DL' 'AX' 'IN' 'BI' 'JM' 'JF' 'KK' 'DR' 'LT' 'GF' 'AW'  'KY' 'CA' 'MK' 'DV' 'EG' 'DW' 'MN' 'V' 'CM' 'GY' 'AF' 'JC' 'MR' 'JE' 'IP'  'KV' 'KH' 'BW' 'MQ' 'D' 'HF' 'CV' 'BL' 'FL' 'GV' 'CQ' 'BM' 'JB' 'J' 'FU'  'AG' 'EJ' 'CH' 'MW' 'X' 'DG' 'AV' 'EW' 'O' 'DO' 'BK' 'FS' 'T' 'CL' 'Y'  'JQ' 'I' 'AL' 'JJ' 'HT' 'FF' 'JA' 'GD' 'FV' 'BQ' 'M' 'S' 'EU' 'P' 'FJ'  'AR' 'LG' 'IR' 'GQ' 'MM' 'AY' 'MF' 'GG' 'KG' 'JD' 'L' 'KS' 'AH' 'JV' 'EV'  'CC' 'AB' 'FK' 'JY' 'G' 'W' 'BC' 'AM' 'KF' 'LU' 'IK' 'BU' 'AT' 'JP' 'Q'  'IJ' 'JO' 'JH' 'AS' 'JN' 'BF' 'AD' 'FP' 'MV' 'AA' 'CJ' 'DY' 'IB' 'AN' 'EQ'  'JT' 'BG' 'AP' 'MB' 'JK' 'FI' 'MS' 'HE' 'C' 'IV' 'IO' 'BT' 'DQ' 'FM' 'HO'  'MH' 'MT' 'FO' 'JI' 'FQ' 'AU' 'FN' 'BB' 'HU' 'IX' 'AE'] test set has  311 values. Unique values are ::  ['HG' 'HK' 'CK' 'DJ' 'HA' 'HY' 'MD' 'KC' 'GC' 'DT' 'HX' 'GE' 'HV' 'HJ' 'DA'  'HL' 'KB' 'JR' 'EP' 'DF' 'DP' 'LN' 'IE' 'GK' 'KW' 'CD' 'CR' 'CG' 'GS' 'LF'  'IF' 'HQ' 'FB' 'LL' 'LQ' 'JE' 'GL' 'LM' 'LB' 'LO' 'DC' 'HB' 'GT' 'CS' 'GX'  'BD' 'CI' 'IC' 'CW' 'EC' 'CH' 'KI' 'MG' 'JW' 'JU' 'HM' 'IT' 'IH' 'IG' 'LY'  'MC' 'EL' 'FH' 'MO' 'KD' 'GU' 'MJ' 'KA' 'FD' 'HH' 'DK' 'AC' 'GI' 'LW' 'BY'  'HN' 'CU' 'BU' 'BO' 'GM' 'KU' 'FR' 'EO' 'CN' 'EI' 'HC' 'LI' 'DS' 'EA' 'ME'  'E' 'GA' 'CB' 'LV' 'CP' 'GN' 'KL' 'CX' 'DH' 'CA' 'BV' 'BX' 'JL' 'KJ' 'EF'  'DD' 'AQ' 'FC' 'GP' 'LX' 'FT' 'HP' 'CM' 'BP' 'CO' 'GJ' 'KR' 'JX' 'KN' 'KP'  'K' 'IU' 'EK' 'LC' 'DO' 'LJ' 'R' 'LT' 'FU' 'KX' 'LD' 'HW' 'DI' 'GW' 'EE'  'GB' 'L' 'KQ' 'BQ' 'EY' 'FE' 'MP' 'MK' 'KS' 'DN' 'LA' 'EN' 'DM' 'AF' 'HD'  'FX' 'FG' 'CQ' 'IM' 'AW' 'EH' 'LK' 'IN' 'DG' 'JC' 'B' 'MU' 'FF' 'KT' 'CT'  'GR' 'IL' 'IQ' 'MI' 'GY' 'MQ' 'AO' 'FA' 'ED' 'I' 'DW' 'AX' 'DU' 'ES' 'EJ'  'HI' 'EB' 'GO' 'LG' 'LE' 'MN' 'BK' 'CL' 'ML' 'IY' 'JM' 'H' 'MA' 'EM' 'AK'  'KE' 'CF' 'HF' 'AJ' 'II' 'Y' 'DX' 'ID' 'GV' 'EW' 'KK' 'HR' 'CV' 'DR' 'IP'  'LH' 'MM' 'BS' 'FW' 'AR' 'GG' 'EG' 'MW' 'KM' 'DL' 'MS' 'JY' 'FP' 'JF' 'BW'  'KY' 'FY' 'GD' 'S' 'CE' 'GH' 'AN' 'KV' 'DE' 'GF' 'AI' 'HT' 'IA' 'BA' 'LR'  'N' 'JP' 'EU' 'JQ' 'BC' 'U' 'MR' 'JG' 'T' 'J' 'BG' 'BM' 'KF' 'IR' 'ET' 'Q'  'MV' 'KO' 'HE' 'JA' 'FK' 'KG' 'FV' 'O' 'BJ' 'JH' 'JV' 'JB' 'IW' 'AD' 'BT'  'F' 'AU' 'IJ' 'AE' 'IV' 'AA' 'DB' 'G' 'JK' 'JJ' 'LP' 'CJ' 'MX' 'BR' 'AV'  'BH' 'JS' 'FQ' 'M' 'FM' 'KH' 'ER' 'AG' 'A' 'AL' 'FL' 'BN' 'BE' 'IS' 'DV'  'FJ' 'CY' 'MH' 'LU' 'BB' 'LS' 'D' 'HS' 'FI' 'EX'] Missing vaues are :  set(['BF', 'FS', 'W', 'BL', 'BI', 'HU', 'JN', 'JO', 'JI', 'DY', 'JD', 'FN', 'FO', 'IB', 'JT', 'DQ', 'IX', 'C', 'AB', 'GQ', 'CC', 'AH', 'AM', 'AP', 'AS', 'AT', 'IO', 'V', 'AY', 'X', 'EV', 'EQ', 'MF', 'MB', 'IK', 'MT', 'P', 'HO']) | 
Let’s plot the categorical variables to see the distribution of the variables.
| 1 2 3 4 5 6 7 8 9 10 | # lets visualize the values in each of the features # keep in mind you'll be seeing a lot of plots now # better is use ipython/jupyter notebook to plot inline plots for feature in categorical_train:     sns.countplot(x = train[feature], data = train)     plt.show() for feature in categorical_test:     sns.countplot(x = test[feature], data = test)     plt.show() | 
One Hot Encoding of categorical variables
Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature.
- The first way is to use dictvectorizer to encode the labels in the feature.
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | # cat1 to cat72 have only two labels A and B # cat73 to cat 108 have more than two labels # cat109 to cat116 have many labels # moreover you must have noticed that some labels are missing in some features of train/test dataset # this might become a problem when working with multiple datasets # to avoid this, we will merge data before doing onehotencoding train_test = pd.concat(train, test).reset_index(drop=True) categorical = train_test.dtypes[train_test.dtypes == "object"].index # lets check for factors in the categorical variables for feature in categorical:     print feature, 'has ', len(train_test[feature].unique()), 'values. Unique values are :: ', train_test[feature].unique() # 1. one hot encoding all categorical variables v = DictVectorizer() train_test_qual = v.fit_transform(train_test[categorical].to_dict('records')) print 'total vocabulary :: ', train_test_qual.vocabulary_ print 'total number of columns', len(train_test_qual.vocabulary_.keys()) print 'total number of new columns added ', len(train_test_qual.vocabulary_.keys()) - len(categorical) # it can be seen that we are adding too many new variables. This encoding is important # since machine learning algorithms dont understand strings and we have to convert string factors # as numeric factors which increase our dimensionality new_df = pd.DataFrame(X_qual.toarray(), columns= [i[0] for i in sorted(v.vocabulary_.items(), key=operator.itemgetter(1))]) new_df = pd.concat([new_df, train_test], axis=1)   # remove initial categorical variables new_df.drop(categorical, axis=1, inplace=True) # take back the train and test set from the above data train_featured = new_df.iloc[:train.shape[0], :] test_featured = new_df.iloc[train.shape[0]:, :] train_featured[continous_train] = train[continous_train] test_featured[continous_train] = test[continous_train] train_featured['loss'] = loss | 
2. Second method is to use pandas to get dummy variables
| 1 2 3 4 5 6 7 8 9 10 11 12 13 | # 2. using get dummies from pandas new_df2 = train_test dummies = pd.get_dummies(train_test[categorical], drop_first = True) new_df2 = pd.concat([new_df2, dummies], axis=1)       new_df2.drop(categorical, inplace=True, axis=1) # take back the train and test set from the above data train_featured2 = new_df2.iloc[:train.shape[0], :] test_featured2 = new_df2.iloc[train.shape[0]:, :] train_featured2[continous_train] = train[continous_train] test_featured2[continous_train] = test[continous_train] train_featured2['loss'] = loss | 
3. Some of these variables only have two labels and some have more than two. One way is to use factorize to convert these labels to numeric
| 1 2 3 4 5 6 7 8 9 10 11 | # 3. pd.factorize new_df3 = train_test for feature in new_df3.columns:     new_df3[feature] = pd.factorize(new_df3[feature], sort=True)[0] # take back the train and test set from the above data train_featured3 = new_df3.iloc[:train.shape[0], :] test_featured3 = new_df3.iloc[train.shape[0]:, :] train_featured3[continous_train] = train[continous_train] test_featured3[continous_train] = test[continous_train] train_featured3['loss'] = loss | 
4. Another way to is to mix the dummy variables and factorize
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | # 4. mixed model # what we can do is mix these models since cat1 to cat72 just have 2 labels, so we can factorize # these variables # for the rest we can make dummies new_df4 = train_test for feature in new_df4.columns[:72]:     new_df4[feature] = pd.factorize(new_df4[feature], sort=True)[0] dummies = pd.get_dummies(train_test[categorical[73:]], drop_first = True) new_df4 = pd.concat([new_df4, dummies], axis=1)       new_df4.drop(categorical[73:], inplace=True, axis=1) # take back the train and test set from the above data train_featured4 = new_df4.iloc[:train.shape[0], :] test_featured4 = new_df4.iloc[train.shape[0]:, :] train_featured4[continous_train] = train[continous_train] test_featured4[continous_train] = test[continous_train] train_featured4['loss'] = loss | 
Here’s the full code
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | # import required libraries # pandas for reading data and manipulation # scikit learn to one hot encoder and label encoder # sns and matplotlib to visualize import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.feature_extraction import DictVectorizer import operator # read data from csv file train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') # let's take a look at the train and test data print '**************************************' print 'TRAIN DATA' print '**************************************' print train.head(5) print '**************************************' print 'TEST DATA' print '**************************************' print test.head(5) # the above code wont print all columns. # to print all columns pd.set_option('display.max_rows', None) pd.set_option('display.max_columns', None) # let's take a look at the train and test data again print '**************************************' print 'TRAIN DATA' print '**************************************' print train.head(5) print '**************************************' print 'TEST DATA' print '**************************************' print test.head(5) # remove ID column. No use. train.drop('id',axis=1,inplace=True) test.drop('id',axis=1,inplace=True) loss = train.drop('loss', axis = 1, inplace = True) # high level statistics. mean media mode count and quartiles # note - this will work only for the continous variables # not for the categorical variables print train.describe() print test.describe() # at this point, it is wise to check whether there are any features that # are there is one of the dataset but not in other missingFeatures = False inTrainNotTest = [] for feature in train.columns:     if feature not in test.columns:         missingFeatures = True         inTrainNotTest.append(feature) if len(inTrainNotTest)>0:     print ', '. join(inTrainNotTest), ' features are present in training set but not in test set' inTestNotTrain = [] for feature in test.columns:     if feature not in train.columns:         missingFeatures = True         inTestNotTrain.append(feature) if len(inTestNotTrain)>0:     print ', '. join(inTestNotTrain), ' features are present in test set but not in training set' # find categorical variables # in this problem, categorical variables are start with cat which is easy # to identify # in other problems it not might be like that # we will see two ways to identify this in this problem # we will also find the continous or numerical variables ## 1. by name categorical_train = [var for var in train.columns if 'cat' in var] categorical_test = [var for var in test.columns if 'cat' in var] continous_train = [var for var in train.columns if 'cont' in var] continous_test = [var for var in test.columns if 'cont' in var] ## 2. by type = object categorical_train = train.dtypes[train.dtypes == "object"].index categorical_test = test.dtypes[test.dtypes == "object"].index continous_train = train.dtypes[train.dtypes != "object"].index continous_test = test.dtypes[test.dtypes != "object"].index # lets check for correlation between continous data # correlation between numerical variables is something like this # if we increase one variable, there is a siginficant almost increase/decrease # in the other variable. it varies from -1 to 1 correlation_train = train[continous_train].corr() correlation_test = train[continous_test].corr() # for the purpose of this analysis, we will consider to variables to # highly correlation if the correlation is more than 0.6 threshold = 0.6 for i in range(len(correlation_train)):     for j in range(len(correlation_train)):         if (i>j) and (correlation_train.iloc[i,j]>threshold):             print ("%s and %s = %.2f" % (train.columns[i],train.columns[j],correlation_train.iloc[i,j])) for i in range(len(correlation_test)):     for j in range(len(correlation_test)):         if (i>j) and (correlation_test.iloc[i,j]>threshold):             print ("%s and %s = %.2f" % (test.columns[i],test.columns[j],correlation_test.iloc[i,j])) ####cat6 and cat1 = 0.76 ####cat7 and cat6 = 0.66 ####cat9 and cat1 = 0.93 ####cat9 and cat6 = 0.80 ####cat10 and cat1 = 0.81 ####cat10 and cat6 = 0.88 ####cat10 and cat9 = 0.79 ####cat11 and cat6 = 0.77 ####cat11 and cat7 = 0.75 ####cat11 and cat9 = 0.61 ####cat11 and cat10 = 0.70 ####cat12 and cat1 = 0.61 ####cat12 and cat6 = 0.79 ####cat12 and cat7 = 0.74 ####cat12 and cat9 = 0.63 ####cat12 and cat10 = 0.71 ####cat12 and cat11 = 0.99 ####cat13 and cat6 = 0.82 ####cat13 and cat9 = 0.64 ####cat13 and cat10 = 0.71 # we can remove one of the two highly correlatied variables to improve performance # lets check for factors in the categorical variables for feature in categorical_train:     print feature, 'has ', len(train[feature].unique()), 'values. Unique values are :: ', train[feature].unique() for feature in categorical_test:     print feature, 'has ', len(test[feature].unique()), 'values. Unique values are :: ', test[feature].unique() # lets take a look whether the unique values/factors are not present in each of the dataset # for example cat1 in both the datasets has values only A & B. Sometimes # it may happen that some new value is present in the test set which maybe ruin your model featuresDone = [] for feature in categorical_train:     if feature in categorical_test:                 if set(train[feature].unique()) - set(test[feature].unique()) != set([]):             print 'Train set has ', len(train[feature].unique()), 'values. Unique values are :: ', train[feature].unique(), '\n'             print 'test set has ', len(test[feature].unique()), 'values. Unique values are :: ', test[feature].unique(), '\n'             print 'Missing vaues are : ', set(train[feature].unique()) - set(test[feature].unique())         featuresDone.append(feature) for feature in categorical_test:     if (feature in categorical_train) and (feature not in featuresDone):                 if set(train[feature].unique()) - set(test[feature].unique()) != set([]):             print 'Train set has ', len(train[feature].unique()), 'values. Unique values are :: ', train[feature].unique(), '\n'             print 'test set has ', len(test[feature].unique()), 'values. Unique values are :: ', test[feature].unique(), '\n'             print 'Missing vaues are : ', set(train[feature].unique()) - set(test[feature].unique())         featuresDone.append(feature) # lets visualize the values in each of the features # keep in mind you'll be seeing a lot of plots now # better is use ipython/jupyter notebook to plot inline plots for feature in categorical_train:     sns.countplot(x = train[feature], data = train)     #plt.show() for feature in categorical_test:     sns.countplot(x = test[feature], data = test)     #plt.show() # cat1 to cat72 have only two labels A and B # cat73 to cat 108 have more than two labels # cat109 to cat116 have many labels # moreover you must have noticed that some labels are missing in some features of train/test dataset # this might become a problem when working with multiple datasets # to avoid this, we will merge data before doing onehotencoding train_test = pd.concat(train, test).reset_index(drop=True) categorical = train_test.dtypes[train_test.dtypes == "object"].index # lets check for factors in the categorical variables for feature in categorical:     print feature, 'has ', len(train_test[feature].unique()), 'values. Unique values are :: ', train_test[feature].unique() # 1. one hot encoding all categorical variables v = DictVectorizer() train_test_qual = v.fit_transform(train_test[categorical].to_dict('records')) print 'total vocabulary :: ', train_test_qual.vocabulary_ print 'total number of columns', len(train_test_qual.vocabulary_.keys()) print 'total number of new columns added ', len(train_test_qual.vocabulary_.keys()) - len(categorical) # it can be seen that we are adding too many new variables. This encoding is important # since machine learning algorithms dont understand strings and we have to convert string factors # as numeric factors which increase our dimensionality new_df = pd.DataFrame(X_qual.toarray(), columns= [i[0] for i in sorted(v.vocabulary_.items(), key=operator.itemgetter(1))]) new_df = pd.concat([new_df, train_test], axis=1)   # remove initial categorical variables new_df.drop(categorical, axis=1, inplace=True) # take back the train and test set from the above data train_featured = new_df.iloc[:train.shape[0], :] test_featured = new_df.iloc[train.shape[0]:, :] train_featured[continous_train] = train[continous_train] test_featured[continous_train] = test[continous_train] train_featured['loss'] = loss # 2. using get dummies from pandas new_df2 = train_test dummies = pd.get_dummies(train_test[categorical], drop_first = True) new_df2 = pd.concat([new_df2, dummies], axis=1)       new_df2.drop(categorical, inplace=True, axis=1) # take back the train and test set from the above data train_featured2 = new_df2.iloc[:train.shape[0], :] test_featured2 = new_df2.iloc[train.shape[0]:, :] train_featured2[continous_train] = train[continous_train] test_featured2[continous_train] = test[continous_train] train_featured2['loss'] = loss # 3. pd.factorize new_df3 = train_test for feature in new_df3.columns:     new_df3[feature] = pd.factorize(new_df3[feature], sort=True)[0] # take back the train and test set from the above data train_featured3 = new_df3.iloc[:train.shape[0], :] test_featured3 = new_df3.iloc[train.shape[0]:, :] train_featured3[continous_train] = train[continous_train] test_featured3[continous_train] = test[continous_train] train_featured3['loss'] = loss # 4. mixed model # what we can do is mix these models since cat1 to cat72 just have 2 labels, so we can factorize # these variables # for the rest we can make dummies new_df4 = train_test for feature in new_df4.columns[:72]:     new_df4[feature] = pd.factorize(new_df4[feature], sort=True)[0] dummies = pd.get_dummies(train_test[categorical[73:]], drop_first = True) new_df4 = pd.concat([new_df4, dummies], axis=1)       new_df4.drop(categorical[73:], inplace=True, axis=1) # take back the train and test set from the above data train_featured4 = new_df4.iloc[:train.shape[0], :] test_featured4 = new_df4.iloc[train.shape[0]:, :] train_featured4[continous_train] = train[continous_train] test_featured4[continous_train] = test[continous_train] train_featured4['loss'] = loss ## this we can use for training and testing in the model | 
Happy Python-ing!
Posted in Machine Learning, Python
 
                    
                


































 
                
            
         浙公网安备 33010602011771号
浙公网安备 33010602011771号