Data conversion – the first step towards data processing

         Convert all string to integers: ranging from 0 to n.

 

Age

continuous.

 

Workclass

Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

 

Fnlwgt

continuous.

 

Education

Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

 

education-num

continuous.

 

marital-status

Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

 

Occupation

Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

 

Relationship

Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

 

Race

White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

 

Sex

Female, Male.

 

capital-gain

continuous.

 

capital-loss

continuous.

 

hours-per-week

continuous.

 

native-country

United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

 

I used a python program to deal with it, but when writing codes, especially the array, I find it is a waste of time to add quotation marks.

So I write a program to help me add the quotation marks:

 1 import time
 2 
 3 #start timing
 4 t1 = time.time()
 5 
 6 #open files
 7 filereader = open('../resource/adult.data', 'r')
 8 filewriter = open('../resource/converted_data.data', 'w')
 9 
10 #define arraies for conversion
11 workclass = ['?', 'Private', 'Self-emp-not-inc', 'Self-emp-inc', 'Federal-gov', 'Local-gov', 'State-gov', 'Without-pay', 'Never-worked']
12 
13 education = ['?','Bachelors', 'Some-college', '11th', 'HS-grad', 'Prof-school', 'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th', 'Masters', '1st-4th', '10th', 'Doctorate', '5th-6th', 'Preschool']
14 
15 marital_status = ['?','Married-civ-spouse','Divorced','Never-married','Separated','Widowed','Married-spouse-absent','Married-AF-spouse']
16 
17 occupation = ['?','Tech-support','Craft-repair','Other-service','Sales','Exec-managerial','Prof-specialty','Handlers-cleaners','Machine-op-inspct','Adm-clerical','Farming-fishing','Transport-moving','Priv-house-serv','Protective-serv','Armed-Forces']
18 
19 relationship = ['?','Wife','Own-child','Husband','Not-in-family','Other-relative','Unmarried']
20 
21 race = ['?','White','Asian-Pac-Islander','Amer-Indian-Eskimo','Other','Black']
22 
23 sex = ['?','Female','Male']
24 
25 native_country = ['?','United-States','Cambodia','England','Puerto-Rico','Canada','Germany','Outlying-US(Guam-USVI-etc)','India','Japan','Greece','South','China','Cuba','Iran','Honduras','Philippines','Italy','Poland','Jamaica','Vietnam','Mexico','Portugal','Ireland','France','Dominican-Republic','Laos','Ecuador','Taiwan','Haiti','Columbia','Hungary','Guatemala','Nicaragua','Scotland','Thailand','Yugoslavia','El-Salvador','Trinadad&Tobago','Peru','Hong','Holand-Netherlands']
26 
27 isover5K = ['?','>50K', '<=50K']
28 
29 #define a 2-dimension array
30 items = [workclass, education, marital_status, occupation, relationship, race, sex, native_country, isover5K]
31 
32 #read file from lines
33 for eachline in filereader:
34    
35     #iterate arraies
36     for item in items:
37 
38         count = 0
39 
40         #iterate strings and replace them with integers
41         for element in item:
42            
43             #replace strings with integers
44             eachline = eachline.replace(element, str(count))
45 
46             count += 1
47 
48     #write to file
49     filewriter.write(eachline)
50    51 
52 
53 #close files
54 filereader.close()
55 filewriter.close()
56 
57 #end timing
58 t2 = time.time()
59 
60 print('done')
61 print(str(t2 - t1))
 posted on 2012-07-09 20:38  Jiang, X.  阅读(457)  评论(0编辑  收藏  举报