Merge是将不同的表按key关联到一个表:https://zhuanlan.zhihu.com/p/392717632
pd.merge(left, # 待合并的2个数据框 right, how='inner', # ‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’ on=None, # 连接的键,默认是相同的键 left_on=None, # 指定不同的连接字段:键不同,但是键的取值有相同的内容 right_on=None, left_index=False, # 根据索引来连接 right_index=False, sort=False, # 是否排序 suffixes=('_x', '_y'), # 改变后缀 copy=True, indicator=False, # 显示字段来源 validate=None)
参数解释:
left、right:待合并的数据帧
how:合并的方式,有5种:{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, 默认是 ‘inner’
1、 left:左连接,保留left的全部数据;right类似;类比于SQL的left join 或者right join
2、outer:全连接功能,类似SQL的full outer join
3、inner:交叉连接,类比于SQL的inner join
4、cross:创建两个数据帧DataFrame的笛卡尔积,默认保留左边的顺序
on:连接的列属性;默认是两个DataFrame的相同字段
left_on/right_on:指定两个不同的键进行联结
left_index、right_index:通过索引进行合并
suffixes:指定我们自己想要的后缀
indictor:显示字段的来源
一、实列
先生成4个表
import pandas as pd df1=pd.DataFrame({'userid':list('abcd'),'age':[10,11,12,13]}) df2=pd.DataFrame({'userid':list('aca'),'score':[100,200,300]}) df3=pd.DataFrame({'userid':list('ae'),'score':[400,600]}) df4=pd.DataFrame({'userid1':list('ac'),'score':[800,900]})
| df1 | userid | age |
|---|---|---|
| 0 | a | 10 |
| 1 | b | 11 |
| 2 | c | 12 |
| 3 | d | 13 |
| df2 | userid | score |
|---|---|---|
| 0 | a | 100 |
| 1 | c | 200 |
| 2 | a | 300 |
| df3 | userid | score |
|---|---|---|
| 0 | a | 400 |
| 1 | e | 600 |
| df4 | userid1 | score |
|---|---|---|
| 0 | a | 800 |
| 1 | c | 900 |
1、默认合并:
pd.merge(df1,df2):
- 两个数据框df1(left)、df2(right)有相同的字段userid
- 默认是通过相同的字段(键)进行关联,取出键中相同的值(ac),而且每个键的记录要全部显示,比如a有多条记录
| userid | age | score | |
|---|---|---|---|
| 0 | a | 10 | 100 |
| 1 | a | 10 | 300 |
| 2 | c | 12 | 200 |
2、inner:(默认是内联)
pd.merge(df1,df2,how='inner')
inner称之为内连接。它会直接根据相同的列属性userid进行关联,取出属性下面相同的数据信息
| userid | age | score | |
|---|---|---|---|
| 0 | a | 10 | 100 |
| 1 | a | 10 | 300 |
| 2 | c | 12 | 200 |
3、outer:外联
pd.merge(df1,df2,how='outer')
outer称之为外连接,在拼接的过程中会取两个数据框中键的并集进行拼接
- 外连接,取出全部交集键的并集。例子中是user的并集
- 如果某个键在某个数据框中不存在数据,则为NaN
| userid | age | score | |
|---|---|---|---|
| 0 | a | 10 | 100.0 |
| 1 | a | 10 | 300.0 |
| 2 | b | 11 | NaN |
| 3 | c | 12 | 200.0 |
| 4 | d | 13 | NaN |
再看一个外联
pd.merge(df1,df3,how='outer')
| userid | age | score | |
|---|---|---|---|
| 0 | a | 10.0 | 400.0 |
| 1 | b | 11.0 | NaN |
| 2 | c | 12.0 | NaN |
| 3 | d | 13.0 | NaN |
| 4 | e | NaN | 600.0 |
4、left:以左边数据框中的键为基准;如果左边存在但是右边不存在,则右边用NaN表示
pd.merge(df1,df3,how='left')
| userid | age | score | |
|---|---|---|---|
| 0 | a | 10 | 400.0 |
| 1 | b | 11 | NaN |
| 2 | c | 12 | NaN |
| 3 | d | 13 | NaN |
再看一个left
pd.merge(df1,df2,how='left')
| userid | age | score | |
|---|---|---|---|
| 0 | a | 10 | 100.0 |
| 1 | a | 10 | 300.0 |
| 2 | b | 11 | NaN |
| 3 | c | 12 | 200.0 |
| 4 | d | 13 | NaN |
5、right:以右边数据框中的键的取值为基准;如果右边存在但是左边不存在,则左边用NaN表示
pd.merge(df1,df3,how='right')
| userid | age | score | |
|---|---|---|---|
| 0 | a | 10.0 | 400 |
| 1 | e | NaN | 600 |
再看一个right
pd.merge(df1,df2,how='right')
| userid | age | score | |
|---|---|---|---|
| 0 | a | 10 | 100 |
| 1 | a | 10 | 300 |
| 2 | c | 12 | 200 |
6、left_on和right_on:当要合并的表没有相同的的键,但这些键的取值部分有相同,则要指明左边和右面的表的键
df_rating=pd.read_csv('./ratings.csv',sep=',',engine='python',header =0,names='UserId,MoveId,Rating,Timestamp'.split(',')) df_move=pd.read_csv('./movies.csv',sep=',',engine='python',header=0,names='moveId,title,genres'.split(',')) df_rat_move=pd.merge(df_rating,df_move,left_on='MoveId',right_on='moveId',how='inne
| UserId | MoveId | Rating | Timestamp | |
|---|---|---|---|---|
| 0 | 1 | 1 | 4.0 | 964982703 |
| 1 | 1 | 3 | 4.0 | 964981247 |
| 2 | 1 | 6 | 4.0 | 964982224 |
| 3 | 1 | 47 | 5.0 | 964983815 |
| 4 | 1 | 50 | 5.0 | 964982931 |
| ... | ... | ... | ... | ... |
| 100831 | 610 | 166534 | 4.0 | 1493848402 |
| 100832 | 610 | 168248 | 5.0 | 1493850091 |
| 100833 | 610 | 168250 | 5.0 | 1494273047 |
| 100834 | 610 | 168252 | 5.0 | 1493846352 |
| 100835 | 610 | 170875 | 3.0 | 1493846415 |
| moveId | title | genres | |
|---|---|---|---|
| 0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
| 1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
| 2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
| 3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
| 4 | 5 | Father of the Bride Part II (1995) | Comedy |
| ... | ... | ... | ... |
| 9737 | 193581 | Black Butler: Book of the Atlantic (2017) | Action|Animation|Comedy|Fantasy |
| 9738 | 193583 | No Game No Life: Zero (2017) | Animation|Comedy|Fantasy |
| 9739 | 193585 | Flint (2017) | Drama |
| 9740 | 193587 | Bungo Stray Dogs: Dead Apple (2018) | Action|Animation |
| 9741 | 193609 | Andrew Dice Clay: Dice Rules (1991) | Comedy |
| UserId | MoveId | Rating | Timestamp | moveId | title | genres | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 4.0 | 964982703 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
| 1 | 5 | 1 | 4.0 | 847434962 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
| 2 | 7 | 1 | 4.5 | 1106635946 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
| 3 | 15 | 1 | 2.5 | 1510577970 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
| 4 | 17 | 1 | 4.5 | 1305696483 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 100831 | 610 | 160341 | 2.5 | 1479545749 | 160341 | Bloodmoon (1997) | Action|Thriller |
| 100832 | 610 | 160527 | 4.5 | 1479544998 | 160527 | Sympathy for the Underdog (1971) | Action|Crime|Drama |
| 100833 | 610 | 160836 | 3.0 | 1493844794 | 160836 | Hazard (2005) | Action|Drama|Thriller |
| 100834 | 610 | 163937 | 3.5 | 1493848789 | 163937 | Blair Witch (2016) | Horror|Thriller |
| 100835 | 610 | 163981 | 3.5 | 1493850155 | 163981 | 31 (2016) | Horror |