Pandas的拼接操作
一:简述
pandas的拼接分为两种:
- 级联:pd.concat, pd.append
- 合并:pd.merge, pd.join
1. 使用pd.concat()级联
pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:
objs
axis=0
keys
join='outer' / 'inner':表示的是级联的方式,outer会将所有的项进行级联(忽略匹配和不匹配),而inner只会将匹配的项级联到一起,不匹配的不级联
ignore_index=False
二:使用pd.concat()级联
2.1:匹配级联
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
df1 = DataFrame(data=np.random.randint(0,100,size=(3,4)),index=['A','B','C'],columns=['a','b','c','d'])
df2 = DataFrame(data=np.random.randint(0,100,size=(3,4)),index=['A','D','C'],columns=['a','b','e','d'])
display(df1,df2)
a b c d
A 97 3 55 77
B 15 60 53 50
C 90 60 83 74
a b e d
A 71 45 13 99
D 94 98 32 89
C 77 75 93 43
pd.concat((df1,df1,df1),axis=1,join='inner')
a b c d a b c d a b c d
A 97 3 55 77 97 3 55 77 97 3 55 77
B 15 60 53 50 15 60 53 50 15 60 53 50
C 90 60 83 74 90 60 83 74 90 60 83 74
2.2:不匹配级联
不匹配指的是级联的维度的索引不一致。
例如纵向级联时列索引不一致,横向级联时行索引不一致
有2种连接方式:
- 外连接:补NaN(默认模式)
- 内连接:只连接匹配的项
pd.concat((df1,df2),axis=0,join='inner')
a b d
A 97 3 77
B 15 60 50
C 90 60 74
A 71 45 99
D 94 98 89
C 77 75 43
三: 使用pd.merge()合并
merge与concat的区别在于,merge需要依据某一共同的列来进行合并
使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。
注意每一列元素的顺序不要求一致
参数:
- how:out取并集 inner取交集
- on:当有多列相同的时候,可以使用on来指定使用那一列进行合并,on的值为一个列表
3.1: 一对一合并
df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
'group':['Accounting','Engineering','Engineering'],
})
df1
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
'hire_date':[2004,2008,2012],
})
df2
employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
pd.merge(df1,df2)
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3.2:多对一合并
df3 = DataFrame({
'employee':['Lisa','Jake'],
'group':['Accounting','Engineering'],
'hire_date':[2004,2016]})
df3
employee group hire_date
0 Lisa Accounting 2004
1 Jake Engineering 2016
df4 = DataFrame({'group':['Accounting','Engineering','Engineering'],
'supervisor':['Carly','Guido','Steve']
})
df4
group supervisor
0 Accounting Carly
1 Engineering Guido
2 Engineering Steve
pd.merge(df3,df4)
employee group hire_date supervisor
0 Lisa Accounting 2004 Carly
1 Jake Engineering 2016 Guido
2 Jake Engineering 2016 Steve
3.3:多对多合并
df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
'group':['Accounting','Engineering','Engineering']})
df1
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
df5 = DataFrame({'group':['Engineering','Engineering','HR'],
'supervisor':['Carly','Guido','Steve']
})
df5
group supervisor
0 Engineering Carly
1 Engineering Guido
2 HR Steve
pd.merge(df1,df5,how='outer')
employee group supervisor
0 Bob Accounting NaN
1 Jake Engineering Carly
2 Jake Engineering Guido
3 Lisa Engineering Carly
4 Lisa Engineering Guido
5 NaN HR Steve
3.4:key的规范化
- 当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名
df1 = DataFrame({'employee':['Jack',"Summer","Steve"],
'group':['Accounting','Finance','Marketing']})
df1
employee group
0 Jack Accounting
1 Summer Finance
2 Steve Marketing
df2 = DataFrame({'employee':['Jack','Bob',"Jake"],
'hire_date':[2003,2009,2012],
'group':['Accounting','sell','ceo']})
df2
employee hire_date group
0 Jack 2003 Accounting
1 Bob 2009 sell
2 Jake 2012 ceo
pd.merge(df1,df2,on='employee')
employee group_x hire_date group_y
0 Jack Accounting 2003 Accounting
- 当两张表没有可进行连接的列时,可使用left_on和right_on手动指定merge中左右两边的哪一列列作为连接的列
df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
'group':['Accounting','Product','Marketing'],
'hire_date':[1998,2017,2018]})
df1
employee group hire_date
0 Bobs Accounting 1998
1 Linda Product 2017
2 Bill Marketing 2018
df5 = DataFrame({'name':['Lisa','Bobs','Bill'],
'hire_dates':[1998,2016,2007]})
df5
name hire_dates
0 Lisa 1998
1 Bobs 2016
2 Bill 2007
pd.merge(df1,df5,left_on='employee',right_on='name')
employee group hire_date name hire_dates
0 Bobs Accounting 1998 Bobs 2016
1 Bill Marketing 2018 Bill 2007
3.5:内合并与外合并:out取并集 inner取交集
- 内合并:只保留两者都有的key(默认模式)
- 外合并 how='outer':补NaN
四: 案例分析:美国各州人口数据分析
- 需求:
- 导入文件,查看原始数据
- 将人口数据和各州简称数据进行合并
- 将合并的数据中重复的abbreviation列进行删除
- 查看存在缺失数据的列
- 找到有哪些state/region使得state的值为NaN,进行去重操作
- 为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
- 合并各州面积数据areas
- 我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
- 去除含有缺失数据的行
- 找出2010年的全民人口数据
- 计算各州的人口密度
- 排序,并找出人口密度最高的五个州 df.sort_values()
# 1.导入文件,查看原始数据
import numpy as np
from pandas import DataFrame,Series
import pandas as pd
abb = pd.read_csv('./data/state-abbrevs.csv')
pop = pd.read_csv('./data/state-population.csv')
area = pd.read_csv('./data/state-areas.csv')
# 查看的数据
abb.head(1)
state abbreviation
0 Alabama AL
pop.head(1)
state/region ages year population
0 AL under18 2012 1117489.0
# 2 将人口数据和各州简称数据进行合并
abb_pop = pd.merge(abb,pop,left_on='abbreviation',right_on='state/region',how='outer')
abb_pop.head(3)
state abbreviation state/region ages year population
0 Alabama AL AL under18 2012 1117489.0
1 Alabama AL AL total 2012 4817528.0
2 Alabama AL AL under18 2010 1130966.0
# 3 将合并的数据中重复的abbreviation列进行删除
abb_pop.drop(labels='abbreviation',axis=1,inplace=True)
# 4 查看存在缺失数据的列
abb_pop.isnull().any(axis=0)
state True
state/region False
ages False
year False
population True
dtype: bool
# 5 找到有哪些state/region使得state的值为NaN,进行去重操作
# 找到哪些简称 的全称为空 (就是先找到state中的空值 ,通过state在找到state/region)
# 把简称找到以后 进行去重
# 找全称为空,用该数据找到简称,然后去重
abb_pop.head(5)
state state/region ages year population
0 Alabama AL under18 2012 1117489.0
1 Alabama AL total 2012 4817528.0
2 Alabama AL under18 2010 1130966.0
3 Alabama AL total 2010 4785570.0
4 Alabama AL under18 2011 1125763.0
# 5.1.找出state中的空值
abb_pop['state'].isnull()
# 5.2.将布尔值作为元数据的行索引:定位到所有state为空对应的行数据
abb_pop.loc[abb_pop['state'].isnull()]
# 5.3.将空对应的行数据中的简称这一列的数据取出进行去重操作
abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()
# array([], dtype=object)
# 6 为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
# 6.1.找出USA对应state列中的空值
# 返回的是bool值
abb_pop['state/region'] == 'USA'
# 6.2.取出USA对应的行数据
abb_pop.loc[abb_pop['state/region'] == 'USA']
indexs = abb_pop.loc[abb_pop['state/region'] == 'USA'].index
indexs
Int64Index([2496, 2497, 2498, 2499, 2500, 2501, 2502, 2503, 2504, 2505, 2506,
2507, 2508, 2509, 2510, 2511, 2512, 2513, 2514, 2515, 2516, 2517,
2518, 2519, 2520, 2521, 2522, 2523, 2524, 2525, 2526, 2527, 2528,
2529, 2530, 2531, 2532, 2533, 2534, 2535, 2536, 2537, 2538, 2539,
2540, 2541, 2542, 2543],
dtype='int64')
# 6.3.将USA对应的空值覆盖成对应的值
abb_pop.loc[indexs,'state'] = 'United States'
# 6.4 找到PR所对应的行数据
abb_pop['state/region'] == 'PR'
abb_pop.loc[abb_pop['state/region'] == 'PR']
indexs = abb_pop.loc[abb_pop['state/region'] == 'PR'].index
abb_pop.loc[indexs,'state'] = 'ppprrr'
area.head()
state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
2 Arizona 114006
3 Arkansas 53182
4 California 163707
# 7 合并各州面积数据areas
abb_pop_area = pd.merge(abb_pop,area,how='outer')
abb_pop_area.head()
state state/region ages year population area (sq. mi)
0 Alabama AL under18 2012.0 1117489.0 52423.0
1 Alabama AL total 2012.0 4817528.0 52423.0
2 Alabama AL under18 2010.0 1130966.0 52423.0
3 Alabama AL total 2010.0 4785570.0 52423.0
4 Alabama AL under18 2011.0 1125763.0 52423.0
# 8 我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
# 9 去除含有缺失数据的行
abb_pop_area['area (sq. mi)'].isnull()
abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()]
# 获取行索引
indexs = abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index
abb_pop_area.drop(labels=indexs,axis=0,inplace=True)
# 10 找出2010年的全民人口数据
# query 做条件查询
df_2010 = abb_pop_area.query('year == 2010 & ages == "total"')
df_2010
# 11 计算各州的人口密度
abb_pop_area['midu'] = abb_pop_area['population'] / abb_pop_area['area (sq. mi)']
abb_pop_area.head(1)
state state/region ages year population area (sq. mi) midu
0 Alabama AL under18 2012.0 1117489.0 52423.0 21.316769
# 12 排序,并找出人口密度最高的五个州 df.sort_values()
abb_pop_area.sort_values(by='midu',axis=0,ascending=False)

浙公网安备 33010602011771号