Pandas数据清洗
Pandas数据清洗
import pandas as pd
import numpy as np
df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006],
"date":pd.date_range('20130102', periods=6),
"city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],
"age":[23,44,54,32,34,32],
"category":['100-A','100-B','110-A','110-C','210-A','130-F'],
"price":[1200,np.nan,2133,5433,np.nan,4432]},
columns =['id','date','city','category','age','price'])
df
| id | date | city | category | age | price | |
|---|---|---|---|---|---|---|
| 0 | 1001 | 2013-01-02 | Beijing | 100-A | 23 | 1200.0 |
| 1 | 1002 | 2013-01-03 | SH | 100-B | 44 | NaN |
| 2 | 1003 | 2013-01-04 | guangzhou | 110-A | 54 | 2133.0 |
| 3 | 1004 | 2013-01-05 | Shenzhen | 110-C | 32 | 5433.0 |
| 4 | 1005 | 2013-01-06 | shanghai | 210-A | 34 | NaN |
| 5 | 1006 | 2013-01-07 | BEIJING | 130-F | 32 | 4432.0 |
1. 用数字0替换缺失值
df['price'].fillna(0,inplace=True)
inplace=True表示在原数据上进行修改,默认是False,即不修改原数据,而是返回一个新的对象
df['price'].fillna(0)
0 1200.0
1 0.0
2 2133.0
3 5433.0
4 0.0
5 4432.0
Name: price, dtype: float64
2. 用均值替换缺失值
df['price'].fillna(df['price'].mean(),inplace=True)
df['price'].fillna(df['price'].mean())
0 1200.0
1 3299.5
2 2133.0
3 5433.0
4 3299.5
5 4432.0
Name: price, dtype: float64
3. 清除city字段的字符空格
df['city'] = df['city'].map(str.strip)
map()函数可以对数据框的每一个元素进行操作,str.strip表示去掉字符串的前后空格
df['city'] = df['city'].map(str.strip)
df['city']
0 Beijing
1 SH
2 guangzhou
3 Shenzhen
4 shanghai
5 BEIJING
Name: city, dtype: object
4. 大小写转换
df['city'] = df['city'].map(str.lower) # 转换为小写
df['city'] = df['city'].str.lower() # 转换为小写
df['city'] = df['city'].map(str.upper) # 转换为大写
df['city'] = df['city'].str.upper() # 转换为大写
df['city'] = df['city'].str.upper()
df['city']
0 BEIJING
1 SH
2 GUANGZHOU
3 SHENZHEN
4 SHANGHAI
5 BEIJING
Name: city, dtype: object
df['city'] = df['city'].map(str.lower)
df['city']
0 beijing
1 sh
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object
5. 更改数据格式
df['price'] = df['price'].astype('int') # 转换为整数
df['price'] = df['price'].astype('float') # 转换为浮点数
6. 更改列名
df.rename(columns={'old_name':'new_name'}, inplace=True)
inplace=True表示在原数据上进行修改,默认是False,即不修改原数据,而是返回一个新的对象
df.rename(columns = {"category" : "category-size"})
| id | date | city | category-size | age | price | |
|---|---|---|---|---|---|---|
| 0 | 1001 | 2013-01-02 | beijing | 100-A | 23 | 1200.0 |
| 1 | 1002 | 2013-01-03 | sh | 100-B | 44 | NaN |
| 2 | 1003 | 2013-01-04 | guangzhou | 110-A | 54 | 2133.0 |
| 3 | 1004 | 2013-01-05 | shenzhen | 110-C | 32 | 5433.0 |
| 4 | 1005 | 2013-01-06 | shanghai | 210-A | 34 | NaN |
| 5 | 1006 | 2013-01-07 | beijing | 130-F | 32 | 4432.0 |
7. 删除后出现的重复值
df.drop_duplicates(inplace=True, keep='first')
keep参数表示保留哪一个重复值,first表示保留第一个,last表示保留最后一个,False表示删除所有重复值, 默认是first
df['city'].drop_duplicates()
0 beijing
1 sh
2 guangzhou
3 shenzhen
4 shanghai
Name: city, dtype: object
8. 数据替换
df['city'].replace('New York', 'NYC', inplace=True)
df['city'].replace('sh', 'shanghai')
0 beijing
1 shanghai
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object

浙公网安备 33010602011771号