Python数据分析
 
 
- 🌸个人主页:JoJo的数据分析历险记
- 📝个人介绍:小编大四统计在读,目前保研到统计学top3高校继续攻读统计研究生
- 💌如果文章对你有帮助,欢迎关注、点赞、收藏、订阅专栏
本专栏主要介绍python数据分析领域的应用
 参考资料:
 利用python数据分析
 
 
 最近小伙伴问我有什么刷题网站推荐,在这里推荐一下牛客网,里面包含各种面经题库,全是免费的题库,可以全方面提升你的职业竞争力,提升编程实战技巧,赶快来和我一起刷题吧!牛客网链接|python篇
 
 
 
我们介绍了Numpy在数据处理方面的应用,本文介绍一下pandas在数据处理方面的应用,pandas可以是基于numpy构建的,但是可以让数据处理变得更便捷
 
导入相关库
 
import numpy as np
import pandas as pd
 
💮1.Series 对象
 
pandas主要有两个数据对象,一个是Series,类似于一个向量的形式,另一个是DataFrame数据框形式。我们先来看一下如何创建一个Series数据对象。
 
s = pd.Series([12,-4,7,9])
s
 
0    12
1    -4
2     7
3     9
dtype: int64
 
🏵️1.1 Series基本操作
 
s[2]
 
7
 
s[2]=5
s
s['a'] = 4
s
 
0    12
1    -4
2     5
3     9
a     4
dtype: int64
 
arr = np.array([1,2,3,4])
s2 = pd.Series(arr)
s2
arr[1] = 9
s2
 
0    1
1    9
2    3
3    4
dtype: int32
 
s[s>8]
 
0    12
3     9
dtype: int64
 
serd = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green','green','yellow'])
serd
 
white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64
 
serd.unique()
 
array([1, 0, 2, 3], dtype=int64)
 
serd.value_counts()
 
2    2
1    2
3    1
0    1
dtype: int64
 
serd.isin([0,3])
 
white     False
white      True
blue      False
green     False
green     False
yellow     True
dtype: bool
 
serd[serd.isin([0,3])]
 
white     0
yellow    3
dtype: int64
 
s2 = pd.Series([-5,3,np.NaN,14])
s2
 
0    -5.0
1     3.0
2     NaN
3    14.0
dtype: float64
 
s2.isnull()
s2.notnull()
 
0     True
1     True
2    False
3     True
dtype: bool
 
s2
 
0    -5.0
1     3.0
2     NaN
3    14.0
dtype: float64
 
mydict = {'red':2000,'blue':1000,'yellow':500,'orange':1000}
myseries = pd.Series(mydict)
myseries
 
red       2000
blue      1000
yellow     500
orange    1000
dtype: int64
 
当出现缺失值时,会直接用NaN替代
 
colors = ['red','blue','yellow','orange','green']
myseries = pd.Series(mydict, index = colors)
myseries
 
red       2000.0
blue      1000.0
yellow     500.0
orange    1000.0
green        NaN
dtype: float64
 
进行运算时有NaN为NaN
 
mydict2 ={'red':400,'yellow':1000,"black":700}
myseries2 = pd.Series(mydict2)
myseries.fillna(0) + myseries2.fillna(0)
 
black        NaN
blue         NaN
green        NaN
orange       NaN
red       2400.0
yellow    1500.0
dtype: float64
 
🌹2.DataFrame对象
 
DataFrame对象是我们在进行数据分析时最常见的数据格式,相当于一个矩阵数据,由不同行不同列组成,通常每一列代表一个变量,每一行代表一个观察数据。我们先来看一下DataFrame的一些基础应用。
 
创建DataFrame对象
 
data = {'color':['blue','green','yellow','red','white'],
        'object':['ball','pen','pencil','paper','mug'],
        'price':[1.2,1.0,0.6,0.9,1.7]}
frame = pd.DataFrame(data)
frame
 
|  | color | object | price | 
|---|
| 0 | blue | ball | 1.2 | 
|---|
| 1 | green | pen | 1.0 | 
|---|
| 2 | yellow | pencil | 0.6 | 
|---|
| 3 | red | paper | 0.9 | 
|---|
| 4 | white | mug | 1.7 | 
|---|
 
frame2 = pd.DataFrame(data, columns=['object','price'])
frame2
 
|  | object | price | 
|---|
| 0 | ball | 1.2 | 
|---|
| 1 | pen | 1.0 | 
|---|
| 2 | pencil | 0.6 | 
|---|
| 3 | paper | 0.9 | 
|---|
| 4 | mug | 1.7 | 
|---|
 
frame3 = pd.DataFrame(data,index=['one','two','three','four','five'])
frame3
 
 
 
|  | color | object | price | 
|---|
| one | blue | ball | 1.2 | 
|---|
| two | green | pen | 1.0 | 
|---|
| three | yellow | pencil | 0.6 | 
|---|
| four | red | paper | 0.9 | 
|---|
| five | white | mug | 1.7 | 
|---|
 
frame.columns
 
Index(['color', 'object', 'price'], dtype='object')
 
frame.index
 
RangeIndex(start=0, stop=5, step=1)
 
frame.values
 
array([['blue', 'ball', 1.2],
       ['green', 'pen', 1.0],
       ['yellow', 'pencil', 0.6],
       ['red', 'paper', 0.9],
       ['white', 'mug', 1.7]], dtype=object)
 
frame['price']
 
0    1.2
1    1.0
2    0.6
3    0.9
4    1.7
Name: price, dtype: float64
 
frame.iloc[2]
 
 
color     yellow
object    pencil
price        0.6
Name: 2, dtype: object
 
frame.iloc[[2,4]]
 
|  | color | object | price | 
|---|
| 2 | yellow | pencil | 0.6 | 
|---|
| 4 | white | mug | 1.7 | 
|---|
 
frame[0:4]
 
对DataFrame进行行选择时,使用索引frame[0:1]返回第一行数据,[1:2]返回第二行数据
 
|  | color | object | price | 
|---|
| 0 | blue | ball | 1.2 | 
|---|
| 1 | green | pen | 1.0 | 
|---|
| 2 | yellow | pencil | 0.6 | 
|---|
| 3 | red | paper | 0.9 | 
|---|
 
frame['object'][3]
 
'paper'
 
frame['new']=12 
frame
 
 
 
|  | color | object | price | new | 
|---|
| 0 | blue | ball | 1.2 | 12 | 
|---|
| 1 | green | pen | 1.0 | 12 | 
|---|
| 2 | yellow | pencil | 0.6 | 12 | 
|---|
| 3 | red | paper | 0.9 | 12 | 
|---|
| 4 | white | mug | 1.7 | 12 | 
|---|
 
frame['new']=[1,2,3,4,5]
frame
 
 
 
|  | color | object | price | new | 
|---|
| 0 | blue | ball | 1.2 | 1 | 
|---|
| 1 | green | pen | 1.0 | 2 | 
|---|
| 2 | yellow | pencil | 0.6 | 3 | 
|---|
| 3 | red | paper | 0.9 | 4 | 
|---|
| 4 | white | mug | 1.7 | 5 | 
|---|
 
frame['price'][2]=3.3
frame
 
 
 
|  | color | object | price | new | 
|---|
| 0 | blue | ball | 1.2 | 1 | 
|---|
| 1 | green | pen | 1.0 | 2 | 
|---|
| 2 | yellow | pencil | 3.3 | 3 | 
|---|
| 3 | red | paper | 0.9 | 4 | 
|---|
| 4 | white | mug | 1.7 | 5 | 
|---|
 
frame['new'] = 12
frame
del frame['new']
frame
 
 
 
|  | color | object | price | 
|---|
| 0 | blue | ball | 1.2 | 
|---|
| 1 | green | pen | 1.0 | 
|---|
| 2 | yellow | pencil | 3.3 | 
|---|
| 3 | red | paper | 0.9 | 
|---|
| 4 | white | mug | 1.7 | 
|---|
 
frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),index = ['red','white','blue','green'],
                      columns=['ball','pen','pencil','paper'])
frame3
frame3[frame3>12]
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| red | NaN | NaN | NaN | NaN | 
|---|
| white | NaN | NaN | NaN | NaN | 
|---|
| blue | NaN | NaN | NaN | NaN | 
|---|
| green | NaN | 13.0 | 14.0 | 15.0 | 
|---|
 
nestdict = {'red':{2012:22, 2013:33},'white':{2011: 13,2012:22,2013:16},'blue':{2011:17,2012:27,2013:48}}
nestdict
 
{'red': {2012: 22, 2013: 33},
 'white': {2011: 13, 2012: 22, 2013: 16},
 'blue': {2011: 17, 2012: 27, 2013: 48}}
 
frame2 = pd.DataFrame(nestdict)
frame2
 
 
 
|  | red | white | blue | 
|---|
| 2011 | NaN | 13 | 17 | 
|---|
| 2012 | 22.0 | 22 | 27 | 
|---|
| 2013 | 33.0 | 16 | 48 | 
|---|
 
进行转置
 
frame2.T
 
 
 
|  | 2011 | 2012 | 2013 | 
|---|
| red | NaN | 22.0 | 33.0 | 
|---|
| white | 13.0 | 22.0 | 16.0 | 
|---|
| blue | 17.0 | 27.0 | 48.0 | 
|---|
 
ser = pd.Series([5,0,3,8,4], index=['red','blue','yellow','white','green'])
ser.index
 
Index(['red', 'blue', 'yellow', 'white', 'green'], dtype='object')
 
ser.idxmax()
 
'white'
 
ser.idxmin()
 
'blue'
 
serd = pd.Series(range(6), index=['white','white','blue','green','green','yellow'])
serd
 
white     0
white     1
blue      2
green     3
green     4
yellow    5
dtype: int64
 
serd['white']
 
white    0
white    1
dtype: int64
 
 
ser = pd.Series([2,5,7,4],index = ['one','two','three','four'])
ser
 
one      2
two      5
three    7
four     4
dtype: int64
 
ser.reindex(['three','one','five','two'])
 
three    7.0
one      2.0
five     NaN
two      5.0
dtype: float64
 
ser3 = pd.Series([1,5,6,3],index=[0,3,5,6])
ser3
 
0    1
3    5
5    6
6    3
dtype: int64
 
ser3.reindex(range(6),method='ffill')
 
0    1
1    1
2    1
3    5
4    5
5    6
dtype: int64
 
ser3.reindex(range(8),method='bfill')
 
0    1.0
1    5.0
2    5.0
3    5.0
4    6.0
5    6.0
6    3.0
7    NaN
dtype: float64
 
frame.reindex(range(5), method='ffill',columns=['colors','price','new','object'])
 
 
 
|  | colors | price | new | object | 
|---|
| 0 | blue | 1.2 | blue | ball | 
|---|
| 1 | green | 1.0 | green | pen | 
|---|
| 2 | yellow | 3.3 | yellow | pencil | 
|---|
| 3 | red | 0.9 | red | paper | 
|---|
| 4 | white | 1.7 | white | mug | 
|---|
 
ser = pd.Series(np.arange(4.),index=['red','blue','yellow','white'])
ser
 
red       0.0
blue      1.0
yellow    2.0
white     3.0
dtype: float64
 
ser.drop('yellow')
 
red      0.0
blue     1.0
white    3.0
dtype: float64
 
ser.drop(['blue','white'])
 
red       0.0
yellow    2.0
dtype: float64
 
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])
frame
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| red | 0 | 1 | 2 | 3 | 
|---|
| blue | 4 | 5 | 6 | 7 | 
|---|
| yellow | 8 | 9 | 10 | 11 | 
|---|
| white | 12 | 13 | 14 | 15 | 
|---|
 
frame.drop(['pen'],axis=1)
 
 
 
|  | ball | pencil | paper | 
|---|
| red | 0 | 2 | 3 | 
|---|
| blue | 4 | 6 | 7 | 
|---|
| yellow | 8 | 10 | 11 | 
|---|
| white | 12 | 14 | 15 | 
|---|
 
🥀3.pandas基本数据运算
 
🌺3.1 算术运算
 
- 当有两个series或DataFrame对象时,如果一个标签,两个对象都有,则把他们的值相加
- 当一个标签只有一个对象有时,则为NaN
s1 = pd.Series([3,2,5,1],index=['white','yellow','green','blue'])
s1
 
white     3
yellow    2
green     5
blue      1
dtype: int64
 
s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])
s1 + s2
 
black     NaN
blue      3.0
brown     NaN
green     NaN
white     4.0
yellow    6.0
dtype: float64
 
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                     columns=['ball','pen','pencil','paper'],
                      index = ['red','blue','yellow','white'])
frame1
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| red | 0 | 1 | 2 | 3 | 
|---|
| blue | 4 | 5 | 6 | 7 | 
|---|
| yellow | 8 | 9 | 10 | 11 | 
|---|
| white | 12 | 13 | 14 | 15 | 
|---|
 
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
                     index = ['blue','yellow','green','white']
                     ,columns=['ball','pen','mug'])
frame2
 
 
 
|  | ball | pen | mug | 
|---|
| blue | 0 | 1 | 2 | 
|---|
| yellow | 3 | 4 | 5 | 
|---|
| green | 6 | 7 | 8 | 
|---|
| white | 9 | 10 | 11 | 
|---|
 
frame3 = frame1+frame2
frame3
 
 
 
|  | ball | mug | paper | pen | pencil | 
|---|
| blue | 4.0 | NaN | NaN | 6.0 | NaN | 
|---|
| green | NaN | NaN | NaN | NaN | NaN | 
|---|
| red | NaN | NaN | NaN | NaN | NaN | 
|---|
| white | 21.0 | NaN | NaN | 23.0 | NaN | 
|---|
| yellow | 11.0 | NaN | NaN | 13.0 | NaN | 
|---|
 
🌻3.2 基本算术运算符
 
主要的算术运算符如下
 
- add() frame1.add(frame2) = frame1+frame2
- sub()
- div()
- mul()
下面通过一些案例来说明
 
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                     columns=['ball','pen','pencil','paper'],
                      index = ['red','blue','yellow','white'])
frame
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| red | 0 | 1 | 2 | 3 | 
|---|
| blue | 4 | 5 | 6 | 7 | 
|---|
| yellow | 8 | 9 | 10 | 11 | 
|---|
| white | 12 | 13 | 14 | 15 | 
|---|
 
ser = pd.Series(np.arange(4),['ball','pen','pencil','paper'])
ser 
 
ball      0
pen       1
pencil    2
paper     3
dtype: int32
 
frame-ser
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| red | 0 | 0 | 0 | 0 | 
|---|
| blue | 4 | 4 | 4 | 4 | 
|---|
| yellow | 8 | 8 | 8 | 8 | 
|---|
| white | 12 | 12 | 12 | 12 | 
|---|
 
当索引项只存在于其中一个数据结构时,那么运算结果会为其产生一个新的索引项,但其值为NaN
 
具体案例如下,我们给ser增加一列mug
 
ser['mug'] = 9
ser
 
ball      0
pen       1
pencil    2
paper     3
mug       9
dtype: int64
 
frame - ser
 
 
 
|  | ball | mug | paper | pen | pencil | 
|---|
| red | 0 | NaN | 0 | 0 | 0 | 
|---|
| blue | 4 | NaN | 4 | 4 | 4 | 
|---|
| yellow | 8 | NaN | 8 | 8 | 8 | 
|---|
| white | 12 | NaN | 12 | 12 | 12 | 
|---|
 
🌼3.3 函数映射
 
在dataframe和series数据对象中,可以使用函数对所有元素进行操作
 
frame
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| red | 0 | 1 | 2 | 3 | 
|---|
| blue | 4 | 5 | 6 | 7 | 
|---|
| yellow | 8 | 9 | 10 | 11 | 
|---|
| white | 12 | 13 | 14 | 15 | 
|---|
 
np.sqrt(frame)
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| red | 0.000000 | 1.000000 | 1.414214 | 1.732051 | 
|---|
| blue | 2.000000 | 2.236068 | 2.449490 | 2.645751 | 
|---|
| yellow | 2.828427 | 3.000000 | 3.162278 | 3.316625 | 
|---|
| white | 3.464102 | 3.605551 | 3.741657 | 3.872983 | 
|---|
 
f = lambda x:x.max()-x.min()
def f(x):
    return x.max()-x.min()
 
frame.apply(f)
 
ball      12
pen       12
pencil    12
paper     12
dtype: int64
 
def f(x):
    return pd.Series([x.min(),x.max()],index = ['min','max'])
 
frame.apply(f,axis = 1)
 
 
 
|  | min | max | 
|---|
| red | 0 | 3 | 
|---|
| blue | 4 | 7 | 
|---|
| yellow | 8 | 11 | 
|---|
| white | 12 | 15 | 
|---|
 
🌷4.统计函数
 
- 数组大多数统计函数对DataFrame对象有用,故可以直接使用
frame.sum()
 
ball      24
pen       28
pencil    32
paper     36
dtype: int64
 
frame.mean()
 
ball      6.0
pen       7.0
pencil    8.0
paper     9.0
dtype: float64
 
frame.describe()
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| count | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 
|---|
| mean | 6.000000 | 7.000000 | 8.000000 | 9.000000 | 
|---|
| std | 5.163978 | 5.163978 | 5.163978 | 5.163978 | 
|---|
| min | 0.000000 | 1.000000 | 2.000000 | 3.000000 | 
|---|
| 25% | 3.000000 | 4.000000 | 5.000000 | 6.000000 | 
|---|
| 50% | 6.000000 | 7.000000 | 8.000000 | 9.000000 | 
|---|
| 75% | 9.000000 | 10.000000 | 11.000000 | 12.000000 | 
|---|
| max | 12.000000 | 13.000000 | 14.000000 | 15.000000 | 
|---|
 ```python ser.rank(method='first') ``` 
red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64
 
🌱4.1 相关性和协方差
 
🌲4.1.1 Series对象
 
- 通常涉及两个数据对象
- 函数分别corr()和cov()
seq = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq
 
2006    3
2007    4
2008    3
2009    4
2010    5
2011    4
2012    3
2013    2
dtype: int64
 
seq2 = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq2
 
2006    1
2007    2
2008    3
2009    4
2010    4
2011    3
2012    2
2013    1
dtype: int64
 
seq.corr(seq2)
 
0.7745966692414834
 
seq.cov(seq2)
 
0.8571428571428571
 
🌳4.1.2DataFrame对象
 
DataFrame对象计算相关性和协方差依然返回一个dataframe对象
 
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])
frame2
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| red | 1 | 4 | 3 | 6 | 
|---|
| blue | 4 | 5 | 6 | 1 | 
|---|
| yellow | 3 | 3 | 1 | 5 | 
|---|
| white | 4 | 1 | 6 | 4 | 
|---|
 
frame2.corr()
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| ball | 1.000000 | -0.276026 | 0.577350 | -0.763763 | 
|---|
| pen | -0.276026 | 1.000000 | -0.079682 | -0.361403 | 
|---|
| pencil | 0.577350 | -0.079682 | 1.000000 | -0.692935 | 
|---|
| paper | -0.763763 | -0.361403 | -0.692935 | 1.000000 | 
|---|
 
frame2.cov()
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| ball | 2.000000 | -0.666667 | 2.000000 | -2.333333 | 
|---|
| pen | -0.666667 | 2.916667 | -0.333333 | -1.333333 | 
|---|
| pencil | 2.000000 | -0.333333 | 6.000000 | -3.666667 | 
|---|
| paper | -2.333333 | -1.333333 | -3.666667 | 4.666667 | 
|---|
 
🌴4.1.3DataFrame和Series相关性
 
corrwith()可以计算DataFrame对象的列或行与Series对象或者其他DataFrame对象元素两两之间的相关性
 
ser
 
red       5
blue      0
yellow    3
white     8
green     4
dtype: int64
 
frame2.corrwith(ser)
 
ball     -0.140028
pen      -0.869657
pencil    0.080845
paper     0.595854
dtype: float64
 
frame2.corrwith(frame)
 
ball      0.730297
pen      -0.831522
pencil    0.210819
paper    -0.119523
dtype: float64
 
🌵4.2排序和秩
 
- Series用sort_values()和rank(),默认是升序,使用ascending=False改变为升序,下同
- DataFrame用sort_index(by=‘’)和rank()
对ser排序
 
ser.sort_values()
 
blue      0
yellow    3
green     4
red       5
white     8
dtype: int64
 
对ser求秩
 
ser.rank()
 
red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64
 
安装pen对frame进行排序
 
frame.sort_values(by='pen')
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| red | 0 | 1 | 2 | 3 | 
|---|
| blue | 4 | 5 | 6 | 7 | 
|---|
| yellow | 8 | 9 | 10 | 11 | 
|---|
| white | 12 | 13 | 14 | 15 | 
|---|
 
ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
ser
 
red       5
blue      0
yellow    3
white     8
green     4
dtype: int64
 
ser.sort_index()
 
blue      0
green     4
red       5
white     8
yellow    3
dtype: int64
 
ser.sort_index(ascending=False)
 
yellow    3
white     8
red       5
green     4
blue      0
dtype: int64
 
ser.sort_values()
 
blue      0
yellow    3
green     4
red       5
white     8
dtype: int64
 
frame
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| red | 0 | 1 | 2 | 3 | 
|---|
| blue | 4 | 5 | 6 | 7 | 
|---|
| yellow | 8 | 9 | 10 | 11 | 
|---|
| white | 12 | 13 | 14 | 15 | 
|---|
 
frame.sort_index()
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| blue | 4 | 5 | 6 | 7 | 
|---|
| red | 0 | 1 | 2 | 3 | 
|---|
| white | 12 | 13 | 14 | 15 | 
|---|
| yellow | 8 | 9 | 10 | 11 | 
|---|
 
axis代表轴,1表示纵轴,0表示横轴
 
frame.sort_index(axis=1)
 
 
 
|  | ball | paper | pen | pencil | 
|---|
| red | 0 | 3 | 1 | 2 | 
|---|
| blue | 4 | 7 | 5 | 6 | 
|---|
| yellow | 8 | 11 | 9 | 10 | 
|---|
| white | 12 | 15 | 13 | 14 | 
|---|
 
🌾5.Pandas缺失值处理
 
🌿5.1 创建NaN数据
 
 
ser = pd.Series([0,1,2,np.NaN,9], index=['red','blue','yellow','white','green'])
ser
 
red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64
 
ser['white']
 
nan
 
☘️5.2 删除NaN
 
- dropna()
- ser[ser.notnull()]
- DataFrame中去除时 为避免删除整行或整列,用how='all’来表示只删除所有元素均为NAN的行或列,如果使用how=‘any’,则只要这一列有缺失值就删除整列
frame3.dropna(how='all')
 
 
 
|  | ball | mug | paper | pen | pencil | 
|---|
| blue | 4.0 | NaN | NaN | 6.0 | NaN | 
|---|
| white | 21.0 | NaN | NaN | 23.0 | NaN | 
|---|
| yellow | 11.0 | NaN | NaN | 13.0 | NaN | 
|---|
 
🍀5.3 为NaN元素填充其他值
 
 
frame3.fillna(0)
 
 
 
|  | ball | mug | paper | pen | pencil | 
|---|
| blue | 4.0 | 0.0 | 0.0 | 6.0 | 0.0 | 
|---|
| green | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 
|---|
| red | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 
|---|
| white | 21.0 | 0.0 | 0.0 | 23.0 | 0.0 | 
|---|
| yellow | 11.0 | 0.0 | 0.0 | 13.0 | 0.0 | 
|---|
 
#若要将不同列的NaN换成不同元素,依次指定列名称及要替换成的元素即可
 
frame3.fillna({'ball':1,"pen":99})
 
 
 
|  | ball | mug | paper | pen | pencil | 
|---|
| blue | 4.0 | NaN | NaN | 6.0 | NaN | 
|---|
| green | 1.0 | NaN | NaN | 99.0 | NaN | 
|---|
| red | 1.0 | NaN | NaN | 99.0 | NaN | 
|---|
| white | 21.0 | NaN | NaN | 23.0 | NaN | 
|---|
| yellow | 11.0 | NaN | NaN | 13.0 | NaN | 
|---|
 
🍁6. 层级索引和分层统计
 
有时候我们需要对数据进行分层级的索引,具体看下面这个例子
 
mser = pd.Series(np.random.rand(8),index=[['white','white','white','blue','blue','red','red','red'],
                                         ['up','down','right','up','down','up','down','left']])
mser
 
white  up       0.323513
       down     0.080292
       right    0.503630
blue   up       0.201143
       down     0.173879
red    up       0.866267
       down     0.601906
       left     0.140885
dtype: float64
 
mser.index
 
MultiIndex(levels=[['blue', 'red', 'white'], ['down', 'left', 'right', 'up']],
           codes=[[2, 2, 2, 0, 0, 1, 1, 1], [3, 0, 2, 3, 0, 3, 0, 1]])
 
mser['white']
 
up       0.323513
down     0.080292
right    0.503630
dtype: float64
 
mser[:,'up']
 
white    0.323513
blue     0.201143
red      0.866267
dtype: float64
 
mser[:,'right']
 
white    0.50363
dtype: float64
 
mser['white','up']
 
0.32351250980575463
 
🍂6.1 unstack()函数和stack()函数
 
unstack把等级索引Series对象转换为一个简单的DataFrame对象,把第二列索引转换为相应的列,stack则相反,具体如下
 
mser.unstack() 
mser.unstack().fillna(0)
 
 
 
|  | down | left | right | up | 
|---|
| blue | 0.173879 | 0.000000 | 0.00000 | 0.201143 | 
|---|
| red | 0.601906 | 0.140885 | 0.00000 | 0.866267 | 
|---|
| white | 0.080292 | 0.000000 | 0.50363 | 0.323513 | 
|---|
 
frame
 
 
 
|  | ball | pen | pencil | paper | 
|---|
| red | 0 | 1 | 2 | 3 | 
|---|
| blue | 4 | 5 | 6 | 7 | 
|---|
| yellow | 8 | 9 | 10 | 11 | 
|---|
| white | 12 | 13 | 14 | 15 | 
|---|
 
frame.stack()
 
red     ball       0
        pen        1
        pencil     2
        paper      3
blue    ball       4
        pen        5
        pencil     6
        paper      7
yellow  ball       8
        pen        9
        pencil    10
        paper     11
white   ball      12
        pen       13
        pencil    14
        paper     15
dtype: int32
 
dataframe对象的行与列也可以定义分层级索引
 
mframe = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index = [['white','white','red','red'],['up','down','up','down']],
                     columns=[['pen','pen','paper','paper'],[1,2,1,2]])
mframe
 
 
 
|  |  | pen | paper | 
|---|
|  |  | 1 | 2 | 1 | 2 | 
|---|
| white | up | 0 | 1 | 2 | 3 | 
|---|
| down | 4 | 5 | 6 | 7 | 
|---|
| red | up | 8 | 9 | 10 | 11 | 
|---|
| down | 12 | 13 | 14 | 15 | 
|---|
 
🍃6.2调整层级顺序
 
- swaplevel()函数以要互换位置的两个层级的名称为参数,返回交换位置后的一个新对象,其中的个元素的顺序保持不变
mframe.columns.names = ['object','id']
mframe.index.names = ['colors','status']
mframe
 
 
 
|  | object | pen | paper | 
|---|
|  | id | 1 | 2 | 1 | 2 | 
|---|
| colors | status |  |  |  |  | 
|---|
| white | up | 0 | 1 | 2 | 3 | 
|---|
| down | 4 | 5 | 6 | 7 | 
|---|
| red | up | 8 | 9 | 10 | 11 | 
|---|
| down | 12 | 13 | 14 | 15 | 
|---|
 
mframe.swaplevel('colors','status')
 
 
 
|  | object | pen | paper | 
|---|
|  | id | 1 | 2 | 1 | 2 | 
|---|
| status | colors |  |  |  |  | 
|---|
| up | white | 0 | 1 | 2 | 3 | 
|---|
| down | white | 4 | 5 | 6 | 7 | 
|---|
| up | red | 8 | 9 | 10 | 11 | 
|---|
| down | red | 12 | 13 | 14 | 15 | 
|---|
 
🌍6.3按层级统计数据
 
 
mframe.sum(level='colors')
 
 
 
| object | pen | paper | 
|---|
| id | 1 | 2 | 1 | 2 | 
|---|
| colors |  |  |  |  | 
|---|
| white | 4 | 6 | 8 | 10 | 
|---|
| red | 20 | 22 | 24 | 26 | 
|---|
 
若想对某一层级的列进行统计,则需要把axis的值设置为1
 
mframe.sum(level='id', axis=1)
 
 
 
|  | id | 1 | 2 | 
|---|
| colors | status |  |  | 
|---|
| white | up | 2 | 4 | 
|---|
| down | 10 | 12 | 
|---|
| red | up | 18 | 20 | 
|---|
| down | 26 | 28 | 
|---|
 
🌎7.数据导入
 
很多时候,我们要分析的数据来自电脑上保存的数据文件,本文介绍一下如何导入我们最常用的csv文件,后续我还会介绍如何导入json数据、以及连接SQL数据库等其他的方式来导入数据
 
import pandas as pd 
df = pd.read_csv('student.csv')
df
 
Student 	ID 	name   age  gender
11        1111    Dw    3  Female
12        1112     Q   23    Male
13        1113     W   21  Female
 
|  | id | color | brand_x | sid | brand_y | 
|---|
| 0 | ball | white | OMG | ball | ABC | 
|---|
| 1 | pencil | red | ABC | pencil | OMG | 
|---|
| 2 | pencil | red | ABC | pencil | POD | 
|---|
| 3 | pen | red | ABC | pen | POD | 
|---|
 
🌏8.数据处理
 
 
🌐8.1 连接
 
使用merge()函数 类似sql中的多表连接
 
🎇8.1.1 内连接
 
import numpy as np
import pandas as pd
 
frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'],
                      'price':[12.33,11.44,33.21,13.23,33.62]})
frame1
 
 
 
|  | id | price | 
|---|
| 0 | ball | 12.33 | 
|---|
| 1 | pencil | 11.44 | 
|---|
| 2 | pen | 33.21 | 
|---|
| 3 | mug | 13.23 | 
|---|
| 4 | ashtray | 33.62 | 
|---|
 
frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'],
                      'color':['white','red','red','black']})
frame2
 
 
 
|  | id | color | 
|---|
| 0 | pencil | white | 
|---|
| 1 | pencil | red | 
|---|
| 2 | ball | red | 
|---|
| 3 | pen | black | 
|---|
 
pd.merge(frame1,frame2)
 
 
 
|  | id | price | color | 
|---|
| 0 | ball | 12.33 | red | 
|---|
| 1 | pencil | 11.44 | white | 
|---|
| 2 | pencil | 11.44 | red | 
|---|
| 3 | pen | 33.21 | black | 
|---|
 
上述返回的DataFrame对象由原来的两个DataFrame对象中ID相同的行组成 并且没有指定基于哪一列进行合并,实际应用中通常要指定连接条件, 用on来zhid
 
frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'],
                      'color':['white','red','red','black','green'],
                      'brand':['OMG','ABC','ABC','POD','POD']})
frame1
 
 
 
|  | id | color | brand | 
|---|
| 0 | ball | white | OMG | 
|---|
| 1 | pencil | red | ABC | 
|---|
| 2 | pen | red | ABC | 
|---|
| 3 | mug | black | POD | 
|---|
| 4 | ashtray | green | POD | 
|---|
 
frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'],
                      'brand':['OMG','POD','ABC','POD']})
frame2
 
 
 
|  | id | brand | 
|---|
| 0 | pencil | OMG | 
|---|
| 1 | pencil | POD | 
|---|
| 2 | ball | ABC | 
|---|
| 3 | pen | POD | 
|---|
 
pd.merge(frame1,frame2,on='id') 
 
 
 
|  | id | color | brand_x | brand_y | 
|---|
| 0 | ball | white | OMG | ABC | 
|---|
| 1 | pencil | red | ABC | OMG | 
|---|
| 2 | pencil | red | ABC | POD | 
|---|
| 3 | pen | red | ABC | POD | 
|---|
 
pd.merge(frame1,frame2,on='brand') 
 
 
 
|  | id_x | color | brand | id_y | 
|---|
| 0 | ball | white | OMG | pencil | 
|---|
| 1 | pencil | red | ABC | ball | 
|---|
| 2 | pen | red | ABC | ball | 
|---|
| 3 | mug | black | POD | pencil | 
|---|
| 4 | mug | black | POD | pen | 
|---|
| 5 | ashtray | green | POD | pencil | 
|---|
| 6 | ashtray | green | POD | pen | 
|---|
 
当出现两个列的名称不一致的时候,使用left_on 和 right_on,例如,下面两个表,一个是id,一个是sid,我们相当于是用第一个表的id和第二个表的sid连接
 
frame2.columns = ['sid','brand']
frame2
 
 
 
|  | sid | brand | 
|---|
| 0 | pencil | OMG | 
|---|
| 1 | pencil | POD | 
|---|
| 2 | ball | ABC | 
|---|
| 3 | pen | POD | 
|---|
 
pd.merge(frame1,frame2,left_on = 'id',right_on ='sid')
 
 
 
|  | id | color | brand_x | sid | brand_y | 
|---|
| 0 | ball | white | OMG | ball | ABC | 
|---|
| 1 | pencil | red | ABC | pencil | OMG | 
|---|
| 2 | pencil | red | ABC | pencil | POD | 
|---|
| 3 | pen | red | ABC | pen | POD | 
|---|
 
merge()函数默认的是内连接,上述结果中的键是由交叉操作出来的
 
🎉8.1.2 外连接
 
- 连接类型用how选项指定
- 左连接 共有的加上左边的
- 右连接 共有的加上右边的
- 外连接把所有的键整合到一起
frame2.columns=['id','brand']
 
pd.merge(frame1,frame2,how='outer')
 
 
 
|  | id | color | brand | 
|---|
| 0 | ball | white | OMG | 
|---|
| 1 | pencil | red | ABC | 
|---|
| 2 | pen | red | ABC | 
|---|
| 3 | mug | black | POD | 
|---|
| 4 | ashtray | green | POD | 
|---|
| 5 | pencil | NaN | OMG | 
|---|
| 6 | pencil | NaN | POD | 
|---|
| 7 | ball | NaN | ABC | 
|---|
| 8 | pen | NaN | POD | 
|---|
 
pd.merge(frame1,frame2,how='left')
 
 
 
|  | id | color | brand | 
|---|
| 0 | ball | white | OMG | 
|---|
| 1 | pencil | red | ABC | 
|---|
| 2 | pen | red | ABC | 
|---|
| 3 | mug | black | POD | 
|---|
| 4 | ashtray | green | POD | 
|---|
 
pd.merge(frame1,frame2,how='right')
 
 
 
|  | id | color | brand | 
|---|
| 0 | pencil | NaN | OMG | 
|---|
| 1 | pencil | NaN | POD | 
|---|
| 2 | ball | NaN | ABC | 
|---|
| 3 | pen | NaN | POD | 
|---|
 
要合并多个键,则把多个键给on选项
 
pd.merge(frame1,frame2,on=['id','brand'],how='outer')
 
 
 
|  | id | color | brand | 
|---|
| 0 | ball | white | OMG | 
|---|
| 1 | pencil | red | ABC | 
|---|
| 2 | pen | red | ABC | 
|---|
| 3 | mug | black | POD | 
|---|
| 4 | ashtray | green | POD | 
|---|
| 5 | pencil | NaN | OMG | 
|---|
| 6 | pencil | NaN | POD | 
|---|
| 7 | ball | NaN | ABC | 
|---|
| 8 | pen | NaN | POD | 
|---|
 
🎊8.1.3 以索引作为键进行连接
 
pd.merge(frame1,frame2,left_index=True,right_index=True)
 
 
 
|  | id_x | color | brand_x | id_y | brand_y | 
|---|
| 0 | ball | white | OMG | pencil | OMG | 
|---|
| 1 | pencil | red | ABC | pencil | POD | 
|---|
| 2 | pen | red | ABC | ball | ABC | 
|---|
| 3 | mug | black | POD | pen | POD | 
|---|
 
frame2.columns = ['id2','brand2']
frame1.join(frame2)
 
 
 
|  | id | color | brand | id2 | brand2 | 
|---|
| 0 | ball | white | OMG | pencil | OMG | 
|---|
| 1 | pencil | red | ABC | pencil | POD | 
|---|
| 2 | pen | red | ABC | ball | ABC | 
|---|
| 3 | mug | black | POD | pen | POD | 
|---|
| 4 | ashtray | green | POD | NaN | NaN | 
|---|
 
🎄8.2拼接
 
- numpy中的concatenation()函数可以用来进行拼接操作
- pandas的concat()函数实现了按轴拼接的功能()
arr1 = np.arange(9).reshape(3,3)
arr1
 
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
 
arr2 = np.arange(6,15).reshape(3,3)
arr2
 
array([[ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])
 
np.concatenate([arr1,arr2],axis=1)
 
array([[ 0,  1,  2,  6,  7,  8],
       [ 3,  4,  5,  9, 10, 11],
       [ 6,  7,  8, 12, 13, 14]])
 
ser1 = pd.Series(np.random.rand(4), index = [1,2,3,4])
ser1
 
1    0.180191
2    0.061649
3    0.236378
4    0.105309
dtype: float64
 
ser2 = pd.Series(np.random.rand(4), index = [5,6,7,8])
ser2
 
5    0.935277
6    0.516146
7    0.210461
8    0.912048
dtype: float64
 
pd.concat([ser1,ser2])
 
1    0.180191
2    0.061649
3    0.236378
4    0.105309
5    0.935277
6    0.516146
7    0.210461
8    0.912048
dtype: float64
 
pd.concat([ser1,ser2],axis = 1)
 
 
 
|  | 0 | 1 | 
|---|
| 1 | 0.180191 | NaN | 
|---|
| 2 | 0.061649 | NaN | 
|---|
| 3 | 0.236378 | NaN | 
|---|
| 4 | 0.105309 | NaN | 
|---|
| 5 | NaN | 0.935277 | 
|---|
| 6 | NaN | 0.516146 | 
|---|
| 7 | NaN | 0.210461 | 
|---|
| 8 | NaN | 0.912048 | 
|---|
 
默认是外连接
 
pd.concat([ser1,ser2],axis=1,join='inner')
 
 
 
 
如果想要创建等级索引,需要用keys选项来完成
 
pd.concat([ser1,ser2],keys=[1,2])
 
1  1    0.180191
   2    0.061649
   3    0.236378
   4    0.105309
2  5    0.935277
   6    0.516146
   7    0.210461
   8    0.912048
dtype: float64
 
pd.concat([ser1,ser2],axis=1,keys=[1,2])
 
 
 
|  | 1 | 2 | 
|---|
| 1 | 0.180191 | NaN | 
|---|
| 2 | 0.061649 | NaN | 
|---|
| 3 | 0.236378 | NaN | 
|---|
| 4 | 0.105309 | NaN | 
|---|
| 5 | NaN | 0.935277 | 
|---|
| 6 | NaN | 0.516146 | 
|---|
| 7 | NaN | 0.210461 | 
|---|
| 8 | NaN | 0.912048 | 
|---|
 
🎋8.3组合
 
- 当无法通过合并或者拼接方法组合数据用组合函数
- combine_first()函数可以用来组合Series对象,同时对齐数据
ser1 = pd.Series(np.random.rand(5), index=[1,2,3,4,5])
ser1
 
1    0.708279
2    0.233048
3    0.030991
4    0.261291
5    0.379752
dtype: float64
 
ser2 = pd.Series(np.random.rand(4), index = [2,4,5,6])
ser2
 
2    0.017397
4    0.764295
5    0.407552
6    0.352605
dtype: float64
 
ser1.combine_first(ser2)
 
1    0.708279
2    0.233048
3    0.030991
4    0.261291
5    0.379752
6    0.352605
dtype: float64
 
pd.concat([ser1,ser2])
 
1    0.708279
2    0.233048
3    0.030991
4    0.261291
5    0.379752
2    0.017397
4    0.764295
5    0.407552
6    0.352605
dtype: float64
 
🏆文章推荐
 
Python数据可视化大杀器之Seaborn:学完可实现90%数据分析绘图
 
Python数据分析大杀器之Numpy详解
 
 
 最近小伙伴问我有什么刷题网站推荐,在这里推荐一下牛客网,里面包含各种面经题库,全是免费的题库,可以全方面提升你的职业竞争力,提升编程实战技巧,赶快来和我一起刷题吧!牛客网链接|python篇