pandas组队学习：task7

一、缺失值的统计和删除

统计

使用isna函数，返回值为每个位置是否缺失的逻辑变量。

查看某一列缺失的行，可以根据isna的返回值进行索引，例如：
```
df[df.Height.isna()].head()
```
如果要查看多个列的情况，可以使用any和all。

例如，isna+all查看三个列同时缺失：
```
sub_set = df[['Height', 'Weight', 'Transfer']]
df[sub_set.isna().all(1)]
```
isna+any：至少一个缺失
```
df[sub_set.isna().any(1)].head()
```
notna+all：没有缺失
```
df[sub_set.notna().all(1)].head() 
```
删除

使用dropna函数：
- axis：默认为0，删除行；1为删除列
- how：删除方式，any或者all
- thresh：非缺失值阈值，没有达到这个阈值的会被删除
例如，删除身高体重至少有一个缺失的行：
```
res = df.dropna(how = 'any', subset = ['Height', 'Weight'])
```
按照第一节的内容，也可以写作：
```
subset = ['Height', 'Weight']
res = df[sub_set.notna().all(1)]	#即索引没有一个缺失的
```
删除超过15个缺失值的列：
```
res = df.dropna(1, thresh=df.shape[0]-15) 
```

二、缺失值的填充和插值

填充：

使用fillna函数：

value：填充值
method：填充方法，ffill，使用前值填充，bfill，使用后值填充
limit：连续缺失值的最大填充数

下面看具体的使用例子：

s = pd.Series([np.nan, 1, np.nan, np.nan, 2, np.nan],
                  list('aaabcd'))
s
Out[336]: 
a    NaN
a    1.0
a    NaN
b    NaN
c    2.0
d    NaN
dtype: float64

使用前值填充：

s.fillna(method='ffill')
Out[337]: 
a    NaN
a    1.0
a    1.0
b    1.0
c    2.0
d    2.0
dtype: float64

连续缺失值，最多填充一次：

s.fillna(method='ffill', limit=1)
Out[338]: 
a    NaN
a    1.0
a    1.0
b    NaN
c    2.0
d    2.0
dtype: float64

通过索引映射填充：

s.fillna({'a': 100, 'd': 200})
Out[339]: 
a    100.0
a      1.0
a    100.0
b      NaN
c      2.0
d    200.0
dtype: float64

练一练：

对一个序列以如下规则填充缺失值：如果单独出现的缺失值，就用前后均值填充，如果连续出现的缺失值就不填充，即序列[1, NaN, 3, NaN, NaN]填充后为[1, 2, 3, NaN, NaN]，请利用 fillna 函数实现。（提示：利用 limit 参数）

解答：

生成一个序列：

s = pd.Series([1,np.nan, 3,np.nan,np.nan])
s
Out[342]: 
0    1.0
1    NaN
2    3.0
3    NaN
4    NaN
dtype: float64

先用前值填充，然后再用后值填充，最后取平均：

a = s.fillna(method = 'ffill',limit=1)
b = s.fillna(method = 'bfill',limit=1)
s = (a+b)/2
s
0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
dtype: float64

插值

使用interpolate函数，默认为线性插值：

limit_direction：控制插值方向，默认为forward向前
limit：最大连续缺失值的插值个数

生成一个序列：

s = pd.Series([np.nan, np.nan, 1,
   ....:                np.nan, np.nan, np.nan,
   ....:                2, np.nan, np.nan])
s
Out[361]: 
0    NaN
1    NaN
2    1.0
3    NaN
4    NaN
5    NaN
6    2.0
7    NaN
8    NaN
dtype: float64

线性向后插值，最大连续个数为1：

res = s.interpolate(limit_direction='backward', limit=1)
res
Out[358]: 
0     NaN
1    1.00
2    1.00
3     NaN
4     NaN
5    1.75
6    2.00
7     NaN
8     NaN
dtype: float64

最近邻插值，使用离缺失值最近的元素：

res = s.interpolate('nearest')
res
Out[363]: 
0    NaN
1    NaN
2    1.0
3    1.0
4    1.0
5    2.0
6    2.0
7    NaN
8    NaN
dtype: float64

索引插值，根据索引的大小进行线性插值：

s = pd.Series([0,np.nan,10],index=[0,1,10])
s
Out[365]: 
0      0.0
1      NaN
10    10.0
dtype: float64

s.interpolate(method='index')
Out[36]: 
0      0.0
1      1.0
10    10.0
dtype: float64

posted @ 2021-01-02 23:08 爱睡觉的皮卡丘阅读(95) 评论(0) 收藏举报

刷新页面返回顶部

pandas组队学习：task7

一、缺失值的统计和删除

统计

删除

二、缺失值的填充和插值

填充：

插值

公告