小渣渣学习笔记 python day60【pandas3 索引迭代二元运算函数应用】

感觉方法不对，DataFrame 100多个属性和方法，例子都敲一遍也记不住，常用的可能那么几个，不如先总览，看1个例子，看不懂就过，用实际的事情处理来找DataFrame的方法，以下的那么多方法都敲完了，第二天还是记不住的。感觉已经乱了，numpy pandas matplotlab 都是针对数据分析常用的模块，适用于大数据方向，后续8天到国庆节前计划再学习下js 了解下ajax 重温下html ，然后重点学习Django架构

numpy ：https://numpy.org/

pandas：https://pandas.pydata.org/

matplotlib:https://matplotlib.org/

pandas DataFrame

indexing,iteration

#isin(values)  eq()
import numpy as np
import pandas as pd
df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]},index=['falcon', 'dog'])
print(df)
print('----------------')
print(df.isin([0,2])) #判断鹰和狗的翅膀数 和腿数是否在列表[0,2]中，在的值写成True 不在的写False
print('--------------')
print(df.eq(4))  #df内的每个元素都和eq函数里的形参对比，一致的，写成True 不一致False

        num_legs  num_wings
falcon         2          2
dog            4          0
----------------
        num_legs  num_wings
falcon      True       True
dog        False       True
--------------
        num_legs  num_wings
falcon     False      False
dog         True      False

#where(condition,other,....)
df=pd.DataFrame(np.arange(12).reshape(3,4),columns=['A','B','C','D'])
print(df)
print('--------------')
print(-df) #可以把每个元素写成负数
print('---------------')
df = df.where(df%3==0,-df) #遍历dataframe 每个元素，如果除以3余数是0，就不变，否则变成相应值的负数
print(df)
print('-------------')
print(df.where(df%3==0,-df)==np.where(df%3==0,df,-df)) #np也有where方法，返回的ndarray 多维数组，df里的where 返回的是跟调用对象df一样类型 ，他们之间可以比较么
print('--------------')
print(df.where(df%3==0,-df)==df.mask(~(df%3==0),-df))  #条件前面加个波浪式啥意思？

   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
--------------
   A  B   C   D
0  0 -1  -2  -3
1 -4 -5  -6  -7
2 -8 -9 -10 -11
---------------
   A  B   C   D
0  0 -1  -2   3
1 -4 -5   6  -7
2 -8  9 -10 -11
-------------
      A     B     C     D
0  True  True  True  True
1  True  True  True  True
2  True  True  True  True
--------------
      A     B     C     D
0  True  True  True  True
1  True  True  True  True
2  True  True  True  True

#query（）
df = pd.DataFrame({'A': range(1, 6),
                   'B': range(10, 0, -2),
                   'C': range(10, 5, -1)})
print(df)
print('-----------------')
print(df.query('A>B')) #A列大于B列的记录只有第4行 等价于df.A>df.B ，返回的是DataFrame

   A   B   C
0  1  10  10
1  2   8   9
2  3   6   8
3  4   4   7
4  5   2   6
-----------------
   A  B  C
4  5  2  6

二元运算函数

add() sub() mul() div() mod() pow()

df = pd.DataFrame({'angles':[0,3,4],'degrees':[360,180,360]},index=['circle','triangle','rectangle'])
print(df)
print('------------')
print(df.add(1))
print('-------------')
print(df+1) #和df.add(1)效果一致
print('---------------')
print(df-[1,2]) #第一列减去1 第二列减去2，等价于 df.sub([1,2],axis='columns')

           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360
------------
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
-------------
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
---------------
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358

eq() ne() lt() le() gt() ge()

df = pd.DataFrame({'cost': [250, 150, 100],
                   'revenue': [100, 250, 300]},
                  index=['A', 'B', 'C'])
print(df)
print('--------------')
print(df==100) #相当于 df.eq(100)
print('--------------')
df != pd.Series([100, 250], index=["cost", "revenue"])

   cost  revenue
A   250      100
B   150      250
C   100      300
--------------
    cost  revenue
A  False     True
B  False    False
C   True    False
--------------

	cost	revenue
A	True	True
B	True	False
C	False	True

df.eq([250, 250, 100], axis='index') #axis 是坐标轴 ，这里坐标轴是行 index 中括号内个数必须对应

	cost	revenue
A	True	False
B	False	True
C	True	False

other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
                     index=['A', 'B', 'C', 'D'])
other

	revenue
A	300
B	250
C	100
D	150

df.gt(other) #df是3行的  other是4行的，比较完了 居然在df的基础上还加了1行，cost是不是根本就没有比较

	cost	revenue
A	False	False
B	False	False
C	False	True
D	False	False

df.at['A','cost']=400
df.gt(other)  #把第一个元素element 设置成400 也没能够让比较后的结果在相应的位置显示True，是不是因为根本就没有比较？

	cost	revenue
A	False	False
B	False	False
C	False	True
D	False	False

#与多索引行的比较
df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
                             'revenue': [100, 250, 300, 200, 175, 225]},
                            index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
                                   ['A', 'B', 'C', 'A', 'B', 'C']])
df_multindex

		cost	revenue
Q1	A	250	100
	B	150	250
	C	100	300
Q2	A	150	200
	B	300	175
	C	220	225

df_multindex.to_excel('D:/aaa.xls') #导出到excel ，然后再读进去就会变味，
df_multindex1 = pd.read_excel('D:/aaa.xls')
df_multindex1

	Unnamed: 0	Unnamed: 1	cost	revenue
0	Q1	A	250	100
1	NaN	B	150	250
2	NaN	C	100	300
3	Q2	A	150	200
4	NaN	B	300	175
5	NaN	C	220	225

df

	cost	revenue
A	400.0	100.0
B	150.0	250.0
C	100.0	300.0

df.le(df_multindex,level=1)

		cost	revenue
Q1	A	False	True
	B	True	True
	C	True	True
Q2	A	False	True
	B	True	False
	C	True	False

combin()

df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
df1 #这样就是直接输出了，类似于 shell界面

	A	B
0	0	4
1	0	4

df2

	A	B
0	1	3
1	1	3

condition = lambda c1,c2:c1 if c1.sum()<c2.sum() else c2  #选择相应位置的最小的值形成一个新的df
df1.combine(df2,condition)

	A	B
0	0	3
1	0	3

df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
print(df1)
print(df2)

df1.combine(df2,np.minimum) #功能和上面的condition一致

	A	B
0	1	2
1	0	3

#combine_first() 好用
df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
df1.combine_first(df2) #用df2 把df1没有的更新掉

	A	B
0	1.0	3.0
1	0.0	4.0

#merge（）好用 和join() 类似
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})

df1

	lkey	value
0	foo	1
1	bar	2
2	baz	3
3	foo	5

df2

	rkey	value
0	foo	5
1	bar	6
2	baz	7
3	foo	8

df1.merge(df2,left_on='lkey',right_on='rkey',suffixes=('_left','_right')) #像数据库select* from t1，t2 where t1.lkey=t2.rkey 得到的结果一致，笛卡尔积

	lkey	value_left	rkey	value_right
0	foo	1	foo	5
1	foo	1	foo	8
2	foo	5	foo	5
3	foo	5	foo	8
4	bar	2	bar	6
5	baz	3	baz	7

#join()
df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
df

	key	A
0	K0	A0
1	K1	A1
2	K2	A2
3	K3	A3
4	K4	A4
5	K5	A5

other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                      'B': ['B0', 'B1', 'B2']})
other

	key	B
0	K0	B0
1	K1	B1
2	K2	B2

df.join(other,lsuffix='_left',rsuffix='_right')

	key_left	A	key_right	B
0	K0	A0	K0	B0
1	K1	A1	K1	B1
2	K2	A2	K2	B2
3	K3	A3	NaN	NaN
4	K4	A4	NaN	NaN
5	K5	A5	NaN	NaN

#append()
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df

	A	B
0	1	2
1	3	4

df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2) #这种方式 index 行索引没有自增1，还可以用下面的方式

	A	B
0	1	2
1	3	4
0	5	6
1	7	8

df.append(df2,ignore_index=True)

	A	B
0	1	2
1	3	4
2	5	6
3	7	8

#动态增加内容的两种方式，方式一 （效率低）
df = pd.DataFrame(columns=['A'])
for i in range(5):
    df = df.append({'A':i},ignore_index=True)
df

	A
0	0
1	1
2	2
3	3
4	4

pd.concat([pd.DataFrame([i],columns=['A']) for i in range(5)],ignore_index=True)

	A
0	0
1	1
2	2
3	3
4	4

#update()  可以把数据源表细微的差异更新到目标表上
df = pd.DataFrame({'A':[1,2,3],'B':[400,500,600]})
df

	A	B
0	1	400
1	2	500
2	3	600

new_df=pd.DataFrame({'B':[4,5,6],'C':[7,8,9]})
new_df

	B	C
0	4	7
1	5	8
2	6	9

df.update(new_df) #只更新相应的列的值从new_df （数据源）更新到df(目标)
df

	A	B
0	1	4
1	2	5
2	3	6

#compare() 好用，可以比较2个表格的细微差异 ，（另外，at iat 和loc iloc 可以改其中的值）
df1 = pd.read_excel('D:/aaa.xls')
df2 = pd.read_excel('D:/bbb.xls')
df1.compare(df2)

	cost
	self	other
0	400.0	250.0

#assign（） 好用
df = pd.DataFrame({'tom':[58,100,22],'jerry':[68,77,82]},index=['course1','course2','course3'])
df

	tom	jerry
course1	58	68
course2	100	77
course3	22	82

df.assign(avg = (df['tom']+df['jerry'])/2) #根据前面的列的运算形成新的列

	tom	jerry	avg
course1	58	68	63.0
course2	100	77	88.5
course3	22	82	52.0

函数应用，分组，窗口

#apply（function，axis=） 函数应用，可选某个坐标轴应用函数，就是说横着用还是竖着用
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
df

	A	B
0	4	9
1	4	9
2	4	9

df.apply(np.sqrt) #函数名来自于numpy  求算术平方根 相当于 np.sqrt(df)

	A	B
0	2.0	3.0
1	2.0	3.0
2	2.0	3.0

df.apply(np.sum,axis=0) #按列求和 A列和是12  B列和是27

A    12
B    27
dtype: int64

#groupby()
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 24., 26.]})
df

	Animal	Max Speed
0	Falcon	380.0
1	Falcon	370.0
2	Parrot	24.0
3	Parrot	26.0

df.groupby(['Animal']).mean() #先分组 ，在求平均值

	Max Speed
Animal
Falcon	375.0
Parrot	25.0

arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
          ['Captive', 'Wild', 'Captive', 'Wild']]
index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
                  index=index)
df

		Max Speed
Animal	Type
Falcon	Captive	390.0
Falcon	Wild	350.0
Parrot	Captive	30.0
Parrot	Wild	20.0

df.groupby(level=0).mean() #按照level=0 分组

	Max Speed
Animal
Falcon	370.0
Parrot	25.0

df.groupby(level='Type').mean() #按照家养和野生分组

	Max Speed
Type
Captive	210.0
Wild	185.0

l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df = pd.DataFrame(l, columns=["a", "b", "c"])
df

	a	b	c
0	1	2.0	3
1	1	NaN	4
2	2	1.0	3
3	1	2.0	2

df.groupby(by=["b"]).sum()   #还没看懂

	a	c
b
1.0	2	3
2.0	2	5

posted on 2020-09-21 18:12 94小渣渣阅读(166) 评论(0) 收藏举报

	Unnamed: 0	Unnamed: 1	cost	revenue
0	Q1	A	250	100
1	NaN	B	150	250
2	NaN	C	100	300
3	Q2	A	150	200
4	NaN	B	300	175
5	NaN	C	220	225

	key_left	A	key_right	B
0	K0	A0	K0	B0
1	K1	A1	K1	B1
2	K2	A2	K2	B2
3	K3	A3	NaN	NaN
4	K4	A4	NaN	NaN
5	K5	A5	NaN	NaN

	Unnamed: 0	Unnamed: 1	cost	revenue
0	Q1	A	250	100
1	NaN	B	150	250
2	NaN	C	100	300
3	Q2	A	150	200
4	NaN	B	300	175
5	NaN	C	220	225

	key_left	A	key_right	B
0	K0	A0	K0	B0
1	K1	A1	K1	B1
2	K2	A2	K2	B2
3	K3	A3	NaN	NaN
4	K4	A4	NaN	NaN
5	K5	A5	NaN	NaN

小渣渣学习笔记 python day60【pandas3 索引 迭代 二元运算 函数应用】

numpy ：https://numpy.org/

pandas：https://pandas.pydata.org/

matplotlib:https://matplotlib.org/

pandas DataFrame

indexing,iteration

二元运算函数

add() sub() mul() div() mod() pow()

eq() ne() lt() le() gt() ge()

combin()

函数应用，分组，窗口

小渣渣学习笔记 python day60【pandas3 索引迭代二元运算函数应用】

	Unnamed: 0	Unnamed: 1	cost	revenue
0	Q1	A	250	100
1	NaN	B	150	250
2	NaN	C	100	300
3	Q2	A	150	200
4	NaN	B	300	175
5	NaN	C	220	225

	key_left	A	key_right	B
0	K0	A0	K0	B0
1	K1	A1	K1	B1
2	K2	A2	K2	B2
3	K3	A3	NaN	NaN
4	K4	A4	NaN	NaN
5	K5	A5	NaN	NaN