数据清洗之异常值处理

1.异常值的处理方法：
1）. 3δ原则：与平均值的偏差超过标准3个标准差
2）. 箱线图法：异常值>上四分位数+1.5IQR 或 异常值<下四分位数-1.5IQR， IQR=上四分位数-下四分位数
3）. 业务常识

# 标准差原则和箱线图法
import numpy as np
import pandas as pd

online_retail_pd = pd.read_csv(r'online_retail.csv', encoding='ISO-8859-1')
# 删除完全重复的行
online_retail_pd.drop_duplicates(inplace=True)
# 删除顾客id为空的行
online_retail_pd.dropna(subset=['CustomerID'], inplace=True)
price_mean = np.mean(online_retail_pd['UnitPrice'])
price_std = np.std(online_retail_pd['UnitPrice'])

# 标准差原则:
print(f'price mean: {price_mean}, price_std: {price_std}')
price_low_bound = price_mean - 3 * price_std
price_upper_bound = price_mean + 3 * price_std
print(f'price low bound: {price_low_bound}, price upper bound: {price_upper_bound}')
# price mean: 3.47406363979831, price_std: 69.76394820732074
# price low bound: -205.8177809821639, price upper bound: 212.76590826176053
# 均值比较小，标准差很大
# 通过3δ原则可知，下界：数据小于-205 ，上界：大于212就是异常值，结合实际业务可知，价格不可能是负数，
# 说明下界是没有参考意义的。

# 箱线图法
# 箱线图法:
# 上四分位数
price_qu = online_retail_pd['UnitPrice'].quantile(q=0.75)
# 下四分位数
price_qr = online_retail_pd['UnitPrice'].quantile(q=0.25)
print(price_qu, price_qr)
# 四分位数间距
price_iqr = price_qu - price_qr
price_max_bound = price_qu + 1.5 * price_iqr
price_min_bound = price_qr - 1.5 * price_iqr
print(f'price low bound: {price_min_bound}, price upper bound: {price_max_bound}')

# 把CustomerID转换为int类型，原本是float类型
online_retail_pd['CustomerID'] = online_retail_pd['CustomerID'].apply(int)

# 新增了三列，用map自定义函数，lambda函数返回长度
online_retail_pd['InvoiceNo_Len'] = online_retail_pd['InvoiceNo'].map(lambda x: len(x))
online_retail_pd['StockCode_Len'] = online_retail_pd['StockCode'].map(lambda x: len(x))
online_retail_pd['CustomerID_Len'] = online_retail_pd['CustomerID'].map(lambda x: len(str(x)))

# print(online_retail_pd.groupby('InvoiceNo_Len').size())
#
# print(online_retail_pd.groupby('StockCode_Len').size())
#
# print(online_retail_pd.groupby('CustomerID_Len').size())

# 3) Quantity、UnitPrice 异常, 取消的订单

print(online_retail_pd[online_retail_pd['Quantity'] <= 0])
# print(online_retail_pd[online_retail_pd['UnitPrice'] <= 0])
online_retail_pd['Is_Cancel'] = online_retail_pd['InvoiceNo'].apply(lambda x: 1 if x[0] == 'C' else 0)
print(online_retail_pd.groupby('Is_Cancel').size())

print(len(online_retail_pd))
online_retail_pd = online_retail_pd[(online_retail_pd['Is_Cancel'] == 0)
& (online_retail_pd['Quantity'] > 0)
& (online_retail_pd['UnitPrice'] > 0)]
print(len(online_retail_pd))

# 数据汇总
# 删除无用字段
online_retail_pd.drop(['InvoiceNo_Len', 'StockCode_Len', 'CustomerID_Len', 'Is_Cancel'], axis=1, inplace=True)

# # 新增销售金额
online_retail_pd['Sale_Amount'] = online_retail_pd['Quantity'] * online_retail_pd['UnitPrice']
online_retail_pd['InvoiceDate'] = pd.to_datetime(online_retail_pd['InvoiceDate'])

# 重置索引(为什么要重置索引，因为之前删掉了一些数据，导致你数据的索引是断开的，所以需要重置索引)
online_retail_pd.reset_index(drop=True)
print(online_retail_pd.info())


posted @ 2021-07-07 18:07  爱时尚疯了的朱  阅读(81)  评论(0编辑  收藏  举报