Title

Pandas数据清洗

Pandas数据清洗

import pandas as pd
import numpy as np
df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], 
 "date":pd.date_range('20130102', periods=6),
  "city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],
 "age":[23,44,54,32,34,32],
 "category":['100-A','100-B','110-A','110-C','210-A','130-F'],
  "price":[1200,np.nan,2133,5433,np.nan,4432]},
  columns =['id','date','city','category','age','price'])
df
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0
4 1005 2013-01-06 shanghai 210-A 34 NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0

1. 用数字0替换缺失值

df['price'].fillna(0,inplace=True)

inplace=True表示在原数据上进行修改,默认是False,即不修改原数据,而是返回一个新的对象

df['price'].fillna(0)
0    1200.0
1       0.0
2    2133.0
3    5433.0
4       0.0
5    4432.0
Name: price, dtype: float64

2. 用均值替换缺失值

df['price'].fillna(df['price'].mean(),inplace=True)
df['price'].fillna(df['price'].mean())
0    1200.0
1    3299.5
2    2133.0
3    5433.0
4    3299.5
5    4432.0
Name: price, dtype: float64

3. 清除city字段的字符空格

df['city'] = df['city'].map(str.strip)

map()函数可以对数据框的每一个元素进行操作,str.strip表示去掉字符串的前后空格

df['city'] = df['city'].map(str.strip)
df['city']
0      Beijing
1           SH
2    guangzhou
3     Shenzhen
4     shanghai
5      BEIJING
Name: city, dtype: object

4. 大小写转换

df['city'] = df['city'].map(str.lower)  # 转换为小写
df['city'] = df['city'].str.lower()  # 转换为小写
df['city'] = df['city'].map(str.upper)  # 转换为大写
df['city'] = df['city'].str.upper()  # 转换为大写
df['city'] = df['city'].str.upper()
df['city']
0      BEIJING
1           SH
2    GUANGZHOU
3     SHENZHEN
4     SHANGHAI
5      BEIJING
Name: city, dtype: object
df['city'] = df['city'].map(str.lower)
df['city']
0      beijing
1           sh
2    guangzhou
3     shenzhen
4     shanghai
5      beijing
Name: city, dtype: object

5. 更改数据格式

df['price'] = df['price'].astype('int')  # 转换为整数
df['price'] = df['price'].astype('float')  # 转换为浮点数

6. 更改列名

df.rename(columns={'old_name':'new_name'}, inplace=True)

inplace=True表示在原数据上进行修改,默认是False,即不修改原数据,而是返回一个新的对象

df.rename(columns = {"category" : "category-size"})
id date city category-size age price
0 1001 2013-01-02 beijing 100-A 23 1200.0
1 1002 2013-01-03 sh 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 shenzhen 110-C 32 5433.0
4 1005 2013-01-06 shanghai 210-A 34 NaN
5 1006 2013-01-07 beijing 130-F 32 4432.0

7. 删除后出现的重复值

df.drop_duplicates(inplace=True, keep='first')

keep参数表示保留哪一个重复值,first表示保留第一个,last表示保留最后一个,False表示删除所有重复值, 默认是first

df['city'].drop_duplicates()
0      beijing
1           sh
2    guangzhou
3     shenzhen
4     shanghai
Name: city, dtype: object

8. 数据替换

df['city'].replace('New York', 'NYC', inplace=True)
df['city'].replace('sh', 'shanghai')
0      beijing
1     shanghai
2    guangzhou
3     shenzhen
4     shanghai
5      beijing
Name: city, dtype: object
posted @ 2025-08-24 08:03  栗悟饭与龟功気波  阅读(8)  评论(0)    收藏  举报