python数据分析——处理丢失数据

处理丢失数据

有两种丢失数据:

  • None
  • np.nan(NaN)

1. None

None是Python自带的,其类型为python object。因此,None不能参与到任何计算中。

In [1]:
#查看None的数据类型
 

2. np.nan(NaN)

 

np.nan是浮点类型,能参与到计算中。但计算的结果总是NaN。

In [2]:
#查看np.nan的数据类型

3. pandas中的None与NaN

1) pandas中None与np.nan都视作np.nan

In [1]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

创建DataFrame

In [2]:
df = DataFrame(data=np.random.randint(10,50,size=(8,8)))
df
Out[2]:
 01234567
0 12 39 29 15 34 17 15 10
1 20 13 34 20 38 22 34 15
2 30 45 41 28 26 22 19 27
3 37 29 47 12 37 49 28 44
4 30 29 40 34 40 41 20 20
5 32 49 23 38 25 13 25 40
6 23 46 31 22 15 37 21 48
7 41 39 35 26 46 48 33 24
In [18]:
#将某些数组元素赋值为nan
In [4]:
df.iloc[1,3] = None
df.iloc[2,2] = None
df.iloc[4,2] = None
df.iloc[6,7] = np.nan
In [5]:
df
Out[5]:
 01234567
0 12 39 29.0 15.0 34 17 15 10.0
1 20 13 34.0 NaN 38 22 34 15.0
2 30 45 NaN 28.0 26 22 19 27.0
3 37 29 47.0 12.0 37 49 28 44.0
4 30 29 NaN 34.0 40 41 20 20.0
5 32 49 23.0 38.0 25 13 25 40.0
6 23 46 31.0 22.0 15 37 21 NaN
7 41 39 35.0 26.0 46 48 33 24.0
 

2) pandas处理空值操作

  • isnull()
  • notnull()
  • dropna(): 过滤丢失数据
  • fillna(): 填充丢失数据
In [8]:
df.isnull()
Out[8]:
 01234567
0 False False False False False False False False
1 False False False True False False False False
2 False False True False False False False False
3 False False False False False False False False
4 False False True False False False False False
5 False False False False False False False False
6 False False False False False False False True
7 False False False False False False False False
In [14]:
#创建DataFrame,给其中某些元素赋值为nan

df.notnull().all(axis=1)     #notnull(all)   isnull(any)
Out[14]:
0     True
1    False
2    False
3     True
4    False
5     True
6    False
7     True
dtype: bool
In [15]:
df.loc[df.notnull().all(axis=1)]
Out[15]:
 01234567
0 12 39 29.0 15.0 34 17 15 10.0
3 37 29 47.0 12.0 37 49 28 44.0
5 32 49 23.0 38.0 25 13 25 40.0
7 41 39 35.0 26.0 46 48 33 24.0
 

(1)判断函数

  • isnull()
  • notnull()
  • df.notnull/isnull().any()/all()
In [4]:
#过滤df中的空值(只保留没有空值的行)

df.dropna() 可以选择过滤的是行还是列(默认为行):axis中0表示行,1表示的列

In [17]:
df.dropna(axis=0)
Out[17]:
 01234567
0 12 39 29.0 15.0 34 17 15 10.0
3 37 29 47.0 12.0 37 49 28 44.0
5 32 49 23.0 38.0 25 13 25 40.0
7 41 39 35.0 26.0 46 48 33 24.0
 

(3) 填充函数 Series/DataFrame

  • fillna():value和method参数
In [18]:
df
Out[18]:
 01234567
0 12 39 29.0 15.0 34 17 15 10.0
1 20 13 34.0 NaN 38 22 34 15.0
2 30 45 NaN 28.0 26 22 19 27.0
3 37 29 47.0 12.0 37 49 28 44.0
4 30 29 NaN 34.0 40 41 20 20.0
5 32 49 23.0 38.0 25 13 25 40.0
6 23 46 31.0 22.0 15 37 21 NaN
7 41 39 35.0 26.0 46 48 33 24.0

可以选择前向填充还是后向填充

In [21]:
df.fillna(method='ffill',axis=1)
Out[21]:
 01234567
0 12.0 39.0 29.0 15.0 34.0 17.0 15.0 10.0
1 20.0 13.0 34.0 34.0 38.0 22.0 34.0 15.0
2 30.0 45.0 45.0 28.0 26.0 22.0 19.0 27.0
3 37.0 29.0 47.0 12.0 37.0 49.0 28.0 44.0
4 30.0 29.0 29.0 34.0 40.0 41.0 20.0 20.0
5 32.0 49.0 23.0 38.0 25.0 13.0 25.0 40.0
6 23.0 46.0 31.0 22.0 15.0 37.0 21.0 21.0
7 41.0 39.0 35.0 26.0 46.0 48.0 33.0 24.0

method 控制填充的方式 bfill ffill

============================================

练习7:

  1. 简述None与NaN的区别

  2. 假设张三李四参加模拟考试,但张三因为突然想明白人生放弃了英语考试,因此记为None,请据此创建一个DataFrame,命名为ddd3

  3. 老师决定根据用数学的分数填充张三的英语成绩,如何实现? 用李四的英语成绩填充张三的英语成绩?

============================================

In [24]:
data = pd.read_excel('测试数据.xlsx')
data.drop(labels=['none1','none2'],axis=1,inplace=True)
data
Out[24]:
 time1234567
0 2019-01-27 17:00:00 -24.8 -18.2 -20.8 -18.8 NaN NaN NaN
1 2019-01-27 17:01:00 -23.5 -18.8 -20.5 -19.8 -15.2 -14.5 -16.0
2 2019-01-27 17:02:00 -23.2 -19.2 NaN NaN -13.0 NaN -14.0
3 2019-01-27 17:03:00 -22.8 -19.2 -20.0 -20.5 NaN -12.2 -9.8
4 2019-01-27 17:04:00 -23.2 -18.5 -20.0 -18.8 -10.2 -10.8 -8.8
5 2019-01-27 17:05:00 NaN NaN -19.0 -18.2 -10.0 -10.5 -10.8
6 2019-01-27 17:06:00 NaN -18.5 -18.2 -17.5 NaN NaN NaN
7 2019-01-27 17:07:00 -24.8 -18.0 -17.5 -17.2 -14.2 -14.0 -12.5
8 2019-01-27 17:08:00 -25.2 -17.8 NaN NaN -16.2 NaN -14.5
9 2019-01-27 17:09:00 -24.8 -18.2 NaN -17.5 NaN -15.5 -16.0
10 2019-01-27 17:10:00 -24.5 -18.5 -16.0 -18.5 -17.5 -16.5 -17.2
11 2019-01-27 17:11:00 NaN NaN -16.0 -18.5 -17.8 -16.8 -12.0
12 2019-01-27 17:12:00 NaN -18.5 -15.8 -18.8 NaN NaN NaN
13 2019-01-27 17:13:00 -23.8 -18.5 NaN NaN 4.5 NaN 0.0
14 2019-01-27 17:14:00 -23.2 -18.2 NaN -19.0 NaN 5.8 6.8
15 2019-01-27 17:15:00 -23.5 -17.8 -15.0 -18.0 10.5 10.5 10.8
16 2019-01-27 17:16:00 NaN NaN -14.2 -17.2 14.0 13.5 13.0
17 2019-01-27 17:17:00 NaN -18.2 -13.8 -17.8 15.8 15.2 14.2
18 2019-01-27 17:18:00 -23.2 -19.0 -13.8 -18.2 NaN NaN NaN
19 2019-01-27 17:19:00 -23.2 -19.5 NaN NaN 17.8 NaN 15.2
20 2019-01-27 17:20:00 -23.2 -19.8 NaN -19.0 18.2 17.2 15.8
21 2019-01-27 17:21:00 -23.5 -20.0 -13.8 -19.5 NaN 17.8 16.0
22 2019-01-27 17:22:00 NaN NaN -14.0 -19.5 18.8 18.0 16.2
23 2019-01-27 17:23:00 -23.2 -20.2 -14.0 -19.5 19.0 18.2 16.5
24 2019-01-27 17:24:00 NaN -20.2 -14.2 -19.5 NaN NaN NaN
25 2019-01-27 17:25:00 -22.8 -20.5 -14.5 -19.5 19.2 NaN 16.5
26 2019-01-27 17:26:00 -22.8 -20.8 -15.0 -16.8 NaN 17.2 16.8
27 2019-01-27 17:27:00 -22.0 -16.0 NaN -16.0 18.8 17.2 16.2
28 2019-01-27 17:28:00 -22.8 -15.2 -14.8 -15.2 18.8 17.2 16.2
29 2019-01-27 17:29:00 -22.5 -15.0 -14.8 -15.2 18.8 17.2 16.5
... ... ... ... ... ... ... ... ...
1030 2019-01-28 10:10:00 -30.5 -27.5 -29.5 -27.8 -3.8 -3.5 -8.2
1031 2019-01-28 10:11:00 -30.8 -27.0 -29.2 -27.8 -3.8 -3.2 -8.5
1032 2019-01-28 10:12:00 -30.5 -26.2 -29.0 -26.8 -3.5 -3.0 -8.8
1033 2019-01-28 10:13:00 -28.8 -25.2 -28.2 -26.2 -3.5 -3.0 -8.8
1034 2019-01-28 10:14:00 -25.2 -25.2 -28.2 -25.8 -3.0 -2.5 -8.8
1035 2019-01-28 10:15:00 -25.2 -25.8 -28.5 -26.2 -3.0 -2.2 -8.5
1036 2019-01-28 10:16:00 -25.8 -26.2 -28.8 -26.8 -2.8 -2.0 -8.2
1037 2019-01-28 10:17:00 -26.2 -26.8 -29.0 -27.2 -2.5 -1.8 -8.2
1038 2019-01-28 10:18:00 -26.5 -27.0 -29.2 -27.5 NaN NaN NaN
1039 2019-01-28 10:19:00 -27.0 -27.2 -29.5 -28.0 -2.2 -1.5 -7.8
1040 2019-01-28 10:20:00 -26.5 -26.8 -29.0 -28.0 -2.2 -1.5 -7.5
1041 2019-01-28 10:21:00 -25.0 -25.8 -28.5 -27.2 -2.2 -1.5 -7.5
1042 2019-01-28 10:22:00 -24.0 -25.2 -28.2 -26.5 -2.0 -1.5 -7.2
1043 2019-01-28 10:23:00 -23.8 -25.0 -28.0 -26.0 -2.0 -1.5 -7.2
1044 2019-01-28 10:24:00 -24.0 -25.2 -28.0 -25.5 -2.0 -1.5 -7.0
1045 2019-01-28 10:25:00 -25.0 -26.0 -28.2 -26.2 -2.2 -1.5 -6.8
1046 2019-01-28 10:26:00 -25.8 -26.5 -28.8 -26.8 -2.2 -1.5 -6.5
1047 2019-01-28 10:27:00 -26.2 -26.5 -28.8 -27.2 -2.2 -1.5 -6.5
1048 2019-01-28 10:28:00 -25.0 -25.8 -28.5 -27.0 -2.2 -1.8 -6.2
1049 2019-01-28 10:29:00 -24.8 -25.2 -28.0 -26.2 -2.2 -1.8 -6.0
1050 2019-01-28 10:30:00 -24.5 -24.8 -27.8 -25.8 -2.0 -2.0 -6.0
1051 2019-01-28 10:31:00 -24.0 -24.8 -27.8 -25.5 -2.0 -2.0 -5.8
1052 2019-01-28 10:32:00 -24.2 -25.5 -28.0 -26.0 -2.0 -2.0 -5.5
1053 2019-01-28 10:33:00 -25.0 -26.2 -28.2 -26.8 -2.0 -2.0 -5.2
1054 2019-01-28 10:34:00 -25.8 -26.8 -28.5 -27.0 -2.0 -2.2 -5.2
1055 2019-01-28 10:35:00 -26.2 -27.2 -28.8 -27.5 -2.0 NaN -5.0
1056 2019-01-28 10:36:00 -26.8 -27.5 -29.0 -27.8 -2.2 NaN -5.0
1057 2019-01-28 10:37:00 -27.2 -27.8 -29.0 -28.0 -2.2 NaN -5.0
1058 2019-01-28 10:38:00 -27.5 -27.0 -29.0 -28.0 -3.5 -3.2 -5.8
1059 2019-01-28 10:39:00 -27.0 -27.2 -29.0 -27.8 -5.0 NaN -7.0

1060 rows × 8 columns

In [27]:
data.shape
Out[27]:
(1060, 8)
In [26]:
data.dropna(axis=0).shape
Out[26]:
(927, 8)
In [35]:
#检测哪些列中存在空值
data.isnull().any(axis=0)
Out[35]:
time    False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
dtype: bool
In [34]:
data.fillna(method='ffill',axis=0,inplace=True)
data.fillna(method='bfill',axis=0,inplace=True)
posted @ 2019-09-30 11:32  陪伴is最长情的告白  阅读(767)  评论(0编辑  收藏  举报