一、数据合并
数据合并主要包括两种操作:
- 轴向连接(concatenation):pd.concat()可以沿一个轴将多个DataFrame对象连接在一起,形成一个新的DataFrame对象
- 融合(merging):pd.merge()方法可以根据一个或多个键将不同的DataFrame中的行连接在一起
1. 轴向连接pd.concat
pd.concat()函数可以将根据不同的轴进行合并,常用参数如下:
pd.concat(objs,axis=0,join='outer')
- objs:Series、DataFrame或者是panel构成的序列list
- axis:需要合并链接的轴,0是行,1是列,默认为0
- join:连接的方式inner,或者outer,默认是outer
    import pandas as pd
    
    dict_1 = {
             'A':['A0','A1','A2','A3'],
             'B':['B0','B1','B2','B3'],
             'C':['C0','C1','C2','C3']}
    df1 = pd.DataFrame(dict_1)
    df1
|  A  |  B  |  C
---|---|---|---
0  |  A0  |  B0  |  C0
1  |  A1  |  B1  |  C1
2  |  A2  |  B2  |  C2
3  |  A3  |  B3  |  C3
    dict_2 = {
             'B':['B0','B1','B2','B3'],
             'C':['C0','C1','C2','C3'],
             'D':['D0','D1','D2','D3']}
    df2 = pd.DataFrame(dict_2)
    df2
|  B  |  C  |  D
---|---|---|---
0  |  B0  |  C0  |  D0
1  |  B1  |  C1  |  D1
2  |  B2  |  C2  |  D2
3  |  B3  |  C3  |  D3
当使用pd.concat()默认参数合并df1和df2时:
    df3 = pd.concat([df1,df2], axis=0, join='outer')
    df3
|  A  |  B  |  C  |  D
---|---|---|---|---
0  |  A0  |  B0  |  C0  |  NaN
1  |  A1  |  B1  |  C1  |  NaN
2  |  A2  |  B2  |  C2  |  NaN
3  |  A3  |  B3  |  C3  |  NaN
0  |  NaN  |  B0  |  C0  |  D0
1  |  NaN  |  B1  |  C1  |  D1
2  |  NaN  |  B2  |  C2  |  D2
3  |  NaN  |  B3  |  C3  |  D3
通过以上的结果发现,当join=‘outer’,axis参数为0时,列进行并集处理,纵向表拼接,缺失值由NaN填充,并且会保留原有数据的行索引。
如果两个表的Index都没有实际含义,使用ignore_index参数,设置成True,即可重新整理一个新的index。
pd.concat([df1,df2],axis=0,join='outer',ignore_index=True
    df4 = pd.concat([df1,df2],axis=0,join='outer',ignore_index=True)
    df4
|  A  |  B  |  C  |  D
---|---|---|---|---
0  |  A0  |  B0  |  C0  |  NaN
1  |  A1  |  B1  |  C1  |  NaN
2  |  A2  |  B2  |  C2  |  NaN
3  |  A3  |  B3  |  C3  |  NaN
4  |  NaN  |  B0  |  C0  |  D0
5  |  NaN  |  B1  |  C1  |  D1
6  |  NaN  |  B2  |  C2  |  D2
7  |  NaN  |  B3  |  C3  |  D3
当我们将concat()的axis参数调整为1,合并df1和df2时:
    df5 = pd.concat([df1,df2],axis=1,join='outer')
    df5
|  A  |  B  |  C  |  B  |  C  |  D
---|---|---|---|---|---|---
0  |  A0  |  B0  |  C0  |  B0  |  C0  |  D0
1  |  A1  |  B1  |  C1  |  B1  |  C1  |  D1
2  |  A2  |  B2  |  C2  |  B2  |  C2  |  D2
3  |  A3  |  B3  |  C3  |  B3  |  C3  |  D3
    当我们将concat()的join参数调整为inner,合并df1和df2时:
    df6 = pd.concat([df1,df2],axis=0,join='inner')
    df6
|  B  |  C
---|---|---
0  |  B0  |  C0
1  |  B1  |  C1
2  |  B2  |  C2
3  |  B3  |  C3
0  |  B0  |  C0
1  |  B1  |  C1
2  |  B2  |  C2
3  |  B3  |  C3
通过以上的结果可知:
- 如果join参数为inner,得到的是两表的交集;
- 如果join参数为outer,得到的是两表的并集。
2. 融合pd.merge()
pd.merge()的常用参数如下:
pd.merge(left,right,how='inner',on=None)
参数介绍:
- left和right:两个要合并的DataFrame
- how:连接方式,有inner、left、right、outer,默认为inner
- on:指的是用于连接的列索引名称,必须存在于左右两个DataFrame,如果没有指定且其他参数也没有指定,则以两个DataFrame列名交集作为连接键
    import pandas as pd
    left = pd.DataFrame({'key':['a','b','b','d'],'data1':range(4)})
    left
[/code]
|  key  |  data1  
---|---|---  
0  |  a  |  0  
1  |  b  |  1  
2  |  b  |  2  
3  |  d  |  3
```code
    right = pd.DataFrame({'key':['a','b','c'],'data2':range(3)})
    right
|  key  |  data2
---|---|---
0  |  a  |  0
1  |  b  |  1
2  |  c  |  2
    df7 = pd.merge(left,right)
    df7
|  key  |  data1  |  data2
---|---|---|---
0  |  a  |  0  |  0
1  |  b  |  1  |  1
2  |  b  |  2  |  1
merge()默认采用inner连接,并且使用两个DataFrame的列名交集(key)作为连接键,同意,最终连接的数据也是两个DataFrame
key列数据的交集
当两个DataFrame使用outer连接时:
    df8= pd.merge(left,right,on=['key'],how='outer')
    df8
|  key  |  data1  |  data2
---|---|---|---
0  |  a  |  0.0  |  0.0
1  |  b  |  1.0  |  1.0
2  |  b  |  2.0  |  1.0
3  |  d  |  3.0  |  NaN
4  |  c  |  NaN  |  2.0
当merge()做outer连接时,最终连接的数据是两个DataFrame key列数据的并集,确实内容由NaN填充
当两个DataFrame使用left连接时:
    df9 = pd.merge(left,right,on=['key'],how='left')
    df9
|  key  |  data1  |  data2
---|---|---|---
0  |  a  |  0  |  0.0
1  |  b  |  1  |  1.0
2  |  b  |  2  |  1.0
3  |  d  |  3  |  NaN
当merge()做left连接时,最终连接的数据是以left数据的链接键为基准合并两个数据的列数据,确实的内容由NaN填充
那么,当两个DataFrame使用right连接时,会发生什么呢?
    df10= pd.merge(left,right,on=['key'],how='right')
    df10
|  key  |  data1  |  data2
---|---|---|---
0  |  a  |  0.0  |  0
1  |  b  |  1.0  |  1
2  |  b  |  2.0  |  1
3  |  c  |  NaN  |  2
两种合并方式的应用场景:
- 有两张表格分别存储了9月和10月份的成交信息,此时可以使用concat()将两个表沿0轴合并
- 现在有两张表格,一个是城郊型,包含订单号、金额、客户ID等;第二个是客户信息,包含客户ID、姓名、电话号码等信息,此时可以使用merge()根据客户ID讲两个表合并成一个完整的表
二、数据的筛选
前面的内容中,我们学习了如何获取一条数据或连续的多条数据,但实际工作中我们经常需要处理上万条。,甚至更多的数据。
那么,要如何才能快速的筛选出符合条件的数据呢?
    import pandas as pd
    
    # 导入数据
    df = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
    # 了解数据基本情况
    print(df.shape)
    df.head()
    (72756, 7)
|  _id  |  关注的收藏夹  |  关注  |  关注者  |  关注的问题  |  关注的话题  |  关注的专栏
---|---|---|---|---|---|---|---
0  |  587598f89f11daf90617fb7a  |  52  |  17  |  1  |  30  |  58  |  2
1  |  587598f89f11daf90617fb7c  |  27  |  73  |  15  |  87  |  26  |  1
2  |  587598f89f11daf90617fb7e  |  72  |  94  |  1  |  112  |  20  |  4
3  |  587598f89f11daf90617fb80  |  174  |  84  |  8  |  895  |  30  |  7
4  |  587598f89f11daf90617fb82  |  3  |  236  |  64  |  119  |  44  |  17
1. 要将关注者超过100的用户数据获取出来
    bools = df['关注者']>100
    bools
    0        False
    1        False
    2        False
    3        False
    4        False
             ...  
    72751    False
    72752    False
    72753    False
    72754    False
    72755    False
    Name: 关注者, Length: 72756, dtype: bool
    df1 = df[bools]
    print(df1.shape)
    df1
    (5501, 7)
|  _id  |  关注的收藏夹  |  关注  |  关注者  |  关注的问题  |  关注的话题  |  关注的专栏
---|---|---|---|---|---|---|---
79  |  587598f89f11daf90617fc18  |  8  |  111  |  3049  |  1216  |  45  |  17
101  |  587598f89f11daf90617fc44  |  61  |  123  |  205  |  670  |  24  |  18
112  |  587598f99f11daf90617fc5a  |  1  |  804  |  197  |  830  |  39  |  26
114  |  587598f99f11daf90617fc5e  |  44  |  62  |  201  |  314  |  8  |  3
121  |  587598f99f11daf90617fc6c  |  44  |  628  |  6597  |  1566  |  144  |
69
...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...
72694  |  5878399e9f11da09c8fe690e  |  3  |  172  |  219  |  466  |  68  |  11
72700  |  5878399e9f11da09c8fe691a  |  21  |  39  |  281  |  51  |  45  |  2
72713  |  5878399e9f11da09c8fe6934  |  49  |  635  |  248  |  1772  |  151  |
123
72726  |  5878399e9f11da09c8fe694e  |  447  |  62  |  728  |  14  |  1  |  21
72741  |  5878399e9f11da09c8fe696c  |  3546  |  390  |  142  |  25  |  4  |
126
5501 rows × 7 columns
来复盘一下以上代码的逻辑:
- 首先判断每个用户的关注者数量是否大于100,大于则返回True,表示该行被标记为True,否则被标记为False。bools记录了每一行是否符合筛选条件,是一个Series对象,其中的值是bool类型。
    bools = df['关注者']>100
- 然后,根据bools每行的值来对df进行筛选,值为True,表示对应的行会留下,否则去除
    df1 = df[bools]
- 最后打印的df1数据就是关注者超过100的用户数据。这是pandas根据某列的值进行筛选的基本逻辑
根据某列的值进行筛选的逻辑如上所述,那么,如果要进行多条件的联合筛选呢?
2. 获取关注者超过300切关注的数量超过100的用户数据
    import pandas as pd
    # 导入数据
    df = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
    # 查看数据基本情况
    df
|  _id  |  关注的收藏夹  |  关注  |  关注者  |  关注的问题  |  关注的话题  |  关注的专栏
---|---|---|---|---|---|---|---
0  |  587598f89f11daf90617fb7a  |  52  |  17  |  1  |  30  |  58  |  2
1  |  587598f89f11daf90617fb7c  |  27  |  73  |  15  |  87  |  26  |  1
2  |  587598f89f11daf90617fb7e  |  72  |  94  |  1  |  112  |  20  |  4
3  |  587598f89f11daf90617fb80  |  174  |  84  |  8  |  895  |  30  |  7
4  |  587598f89f11daf90617fb82  |  3  |  236  |  64  |  119  |  44  |  17
...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...
72751  |  5878399e9f11da09c8fe6980  |  112  |  77  |  27  |  39  |  2  |  32
72752  |  5878399e9f11da09c8fe6982  |  15  |  90  |  16  |  61  |  116  |  2
72753  |  5878399e9f11da09c8fe6984  |  7  |  19  |  1  |  25  |  12  |  2
72754  |  5878399e9f11da09c8fe6986  |  14  |  320  |  3  |  68  |  38  |  9
72755  |  5878399e9f11da09c8fe6988  |  7  |  1  |  2  |  7  |  15  |  0
72756 rows × 7 columns
    bool_1 = df['关注者']>300
    bool_2 = df['关注']>100
    df2 = df[bool_1 & bool_2]
    df2
|  _id  |  关注的收藏夹  |  关注  |  关注者  |  关注的问题  |  关注的话题  |  关注的专栏
---|---|---|---|---|---|---|---
79  |  587598f89f11daf90617fc18  |  8  |  111  |  3049  |  1216  |  45  |  17
121  |  587598f99f11daf90617fc6c  |  44  |  628  |  6597  |  1566  |  144  |
69
155  |  587598f99f11daf90617fcb0  |  194  |  250  |  1103  |  10  |  1  |  19
228  |  587598f99f11daf90617fd42  |  48  |  224  |  1450  |  360  |  128  |
20
261  |  587598f99f11daf90617fd84  |  1125  |  842  |  24848  |  108  |  33  |
27
...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...
72173  |  5878399b9f11da09c8fe64fc  |  38  |  5000  |  1386  |  2205  |  25  |
6
72416  |  5878399d9f11da09c8fe66e2  |  661  |  299  |  457  |  32  |  4  |  9
72463  |  5878399d9f11da09c8fe6740  |  274  |  209  |  1556  |  2405  |  188
|  318
72648  |  5878399e9f11da09c8fe68b2  |  10  |  278  |  471  |  763  |  66  |
12
72661  |  5878399e9f11da09c8fe68cc  |  756  |  491  |  414  |  229  |  2  |
58
1921 rows × 7 columns
复盘以上的代码:
- 通过2个限制条件df[‘关注者’]>300和df[‘关注’]>100,分别得到bool_1和bool_2这2个Series
- 因为我们的需求是要同时满足两个条件,所以使用逻辑运算符连接两个值
- 最后获取同时满足两个条件的值
3. 数据筛选的第二种方法
使用loc[]方法实现筛选的效果。
    import pandas as pd
    
    # 导入数据
    df = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
    # 了解数据基本情况
    print(df.shape)
    df.head()
    (72756, 7)
|  _id  |  关注的收藏夹  |  关注  |  关注者  |  关注的问题  |  关注的话题  |  关注的专栏
---|---|---|---|---|---|---|---
0  |  587598f89f11daf90617fb7a  |  52  |  17  |  1  |  30  |  58  |  2
1  |  587598f89f11daf90617fb7c  |  27  |  73  |  15  |  87  |  26  |  1
2  |  587598f89f11daf90617fb7e  |  72  |  94  |  1  |  112  |  20  |  4
3  |  587598f89f11daf90617fb80  |  174  |  84  |  8  |  895  |  30  |  7
4  |  587598f89f11daf90617fb82  |  3  |  236  |  64  |  119  |  44  |  17
    # 获取关注者超过100的用户数据
    df_new = df.loc[df['关注者']>100]
    df_new
|  _id  |  关注的收藏夹  |  关注  |  关注者  |  关注的问题  |  关注的话题  |  关注的专栏
---|---|---|---|---|---|---|---
79  |  587598f89f11daf90617fc18  |  8  |  111  |  3049  |  1216  |  45  |  17
101  |  587598f89f11daf90617fc44  |  61  |  123  |  205  |  670  |  24  |  18
112  |  587598f99f11daf90617fc5a  |  1  |  804  |  197  |  830  |  39  |  26
114  |  587598f99f11daf90617fc5e  |  44  |  62  |  201  |  314  |  8  |  3
121  |  587598f99f11daf90617fc6c  |  44  |  628  |  6597  |  1566  |  144  |
69
...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...
72694  |  5878399e9f11da09c8fe690e  |  3  |  172  |  219  |  466  |  68  |  11
72700  |  5878399e9f11da09c8fe691a  |  21  |  39  |  281  |  51  |  45  |  2
72713  |  5878399e9f11da09c8fe6934  |  49  |  635  |  248  |  1772  |  151  |
123
72726  |  5878399e9f11da09c8fe694e  |  447  |  62  |  728  |  14  |  1  |  21
72741  |  5878399e9f11da09c8fe696c  |  3546  |  390  |  142  |  25  |  4  |
126
5501 rows × 7 columns
三、数据的排序
在数据获取过程中,数据的排序也是我们经常需要处理的问题。例如:我们需要找出关注者数量前10的用户信息

可以使用sort_index()、sort_values()两个方法对数据进行排序,并且这两个方法Series和DataFrame都支持
那么,DataFrame的这两个方法有什么区别呢?
- sort_index()方法是按照行进行排序
- sort_values()可以指定具体列进行排序
1. 根据国家名称来进行排序
    import pandas as pd
    
    people = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx',index_col='Country Code')
    people
|  CountryName  |  1990  |  2000  |  2007  |  2008  |  2009  |  2010  |  2011
|  2012  |  2013  |  2014  |  2015  |  Change 1990-2015  |  Change 2007-2015
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
Country Code  |  |  |  |  |  |  |  |  |  |  |  |  |  |
AFG  |  Afghanistan  |  101.094930  |  103.254202  |  100.000371  |
100.215886  |  100.060480  |  99.459839  |  97.667911  |  95.312707  |
92.602785  |  89.773777  |  86.954464  |  -14.140466  |  -13.045907
ALB  |  Albania  |  61.808311  |  59.585866  |  50.862987  |  49.663787  |
48.637067  |  NaN  |  46.720288  |  45.835739  |  45.247477  |  44.912168  |
44.806973  |  -17.001338  |  -6.056014
DZA  |  Algeria  |  87.675705  |  62.886169  |  49.487870  |  48.910002  |
48.645026  |  48.681853  |  49.233576  |  49.847713  |  50.600697  |
51.536631  |  52.617579  |  -35.058127  |  3.129709
ASM  |  American Samoa  |  NaN  |  NaN  |  NaN  |  NaN  |  NaN  |  NaN  |  NaN
|  NaN  |  NaN  |  NaN  |  NaN  |  NaN  |  NaN
ADO  |  Andorra  |  NaN  |  NaN  |  NaN  |  NaN  |  NaN  |  NaN  |  NaN  |
NaN  |  NaN  |  NaN  |  NaN  |  NaN  |  NaN
...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...  |
...  |  ...  |  ...  |  ...  |  ...
VIR  |  Virgin Islands (U.S.)  |  53.546901  |  52.647183  |  49.912779  |
50.459425  |  51.257336  |  52.382523  |  53.953515  |  55.687666  |
57.537152  |  59.399244  |  61.199651  |  7.652751  |  11.286872
WBG  |  West Bank and Gaza  |  102.789182  |  100.410470  |  88.367099  |
86.204936  |  84.173732  |  82.333762  |  80.851747  |  79.426778  |
78.118403  |  76.975462  |  76.001869  |  -26.787313  |  -12.365230
YEM  |  Yemen, Rep.  |  118.779727  |  105.735754  |  88.438350  |  85.899643
|  83.594489  |  81.613924  |  80.193948  |  78.902603  |  77.734373  |
76.644268  |  75.595147  |  -43.184581  |  -12.843203
ZMB  |  Zambia  |  100.485263  |  97.124776  |  98.223058  |  98.277253  |
98.260795  |  98.148325  |  97.854237  |  97.385802  |  96.791310  |
96.122165  |  95.402326  |  -5.082936  |  -2.820732
ZWE  |  Zimbabwe  |  96.365244  |  84.877487  |  81.272166  |  81.024020  |
80.934968  |  80.985702  |  80.740494  |  80.579870  |  80.499816  |
80.456439  |  80.391033  |  -15.974211  |  -0.881133
219 rows × 14 columns
    people.sort_index(inplace=True,ascending=True)
    people.head()
|  CountryName  |  1990  |  2000  |  2007  |  2008  |  2009  |  2010  |  2011
|  2012  |  2013  |  2014  |  2015  |  Change 1990-2015  |  Change 2007-2015
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
Country Code  |  |  |  |  |  |  |  |  |  |  |  |  |  |
ABW  |  Aruba  |  47.500831  |  44.420778  |  43.475981  |  43.957839  |
44.531061  |  45.157235  |  44.993173  |  44.706680  |  44.355302  |
44.070252  |  43.952390  |  -3.548441  |  0.476409
ADO  |  Andorra  |  NaN  |  NaN  |  NaN  |  NaN  |  NaN  |  NaN  |  NaN  |
NaN  |  NaN  |  NaN  |  NaN  |  NaN  |  NaN
AFG  |  Afghanistan  |  101.094930  |  103.254202  |  100.000371  |
100.215886  |  100.060480  |  99.459839  |  97.667911  |  95.312707  |
92.602785  |  89.773777  |  86.954464  |  -14.140466  |  -13.045907
AGO  |  Angola  |  101.394722  |  100.930475  |  102.563811  |  102.609186  |
102.428788  |  102.035690  |  102.106756  |  101.836900  |  101.315234  |
100.637667  |  99.855751  |  -1.538971  |  -2.708060
ALB  |  Albania  |  61.808311  |  59.585866  |  50.862987  |  49.663787  |
48.637067  |  NaN  |  46.720288  |  45.835739  |  45.247477  |  44.912168  |
44.806973  |  -17.001338  |  -6.056014
复盘以上代码:
- read_excel()中的参数index_col='Country Code’作用是在读取文件的时候指定Country Code这一列数据作为行索引
- inplace=True参数和我们之前见过的作用一样,用来控制是否直接对原始数据进行修改
- ascengding可以控制排序的顺序,默认值True小到大升序排列,但设置为False时就可以实现降序排列
2. 获取关注者数据前十的用户数据
那么我们需要根据关注者这一列数据进行排序。
    import pandas as pd
    
    user = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
    user.sort_values(by='关注者',ascending=False,inplace=True)
    user.head(10)
|  _id  |  关注的收藏夹  |  关注  |  关注者  |  关注的问题  |  关注的话题  |  关注的专栏
---|---|---|---|---|---|---|---
23313  |  5876ee159f11da03c9726155  |  3181  |  1053  |  585787  |  93  |  33
|  26
14461  |  5876ede89f11da03c9721c33  |  1041  |  254  |  356246  |  92  |  3  |
23
22048  |  5876ee0c9f11da03c9725773  |  41  |  72  |  351414  |  336  |  47  |
16
52464  |  5876ece59f11da03afbc71b3  |  1652  |  794  |  308553  |  23  |  1  |
112
42135  |  5876eb5a9f11da0399d461b7  |  1182  |  532  |  268353  |  23  |  1  |
54
63579  |  5878396e9f11da09c8fe21e6  |  10819  |  1412  |  198199  |  90  |  1
|  517
21740  |  5876ee0b9f11da03c972550b  |  991  |  479  |  195175  |  46  |  3  |
18
56721  |  5876ecfa9f11da03afbc92f1  |  164  |  550  |  164254  |  0  |  1  |
19
22784  |  5876ee129f11da03c9725d33  |  2011  |  579  |  148457  |  49  |  6  |
44
29577  |  587838f59f11da09bd427945  |  8  |  303  |  137087  |  477  |  19  |
15
复盘以上代码:
- by:决定了是按照数据中的哪一列进行排序,将需要按照某列排序的列名赋值给by即可
- ascending=False:将数据按大到小的顺序排列
- inplace=Ture:用来控制是否直接对原始数据进行修改
四、总结
数据合并

数据筛选

数据排序

五、练习
- 目前手上有两份数据,将两个csv数据合并成一个,并按骑行时间进行降序排列,获取最长的骑行时间。
    import pandas as pd
    # 读取第一份数据
    bike1 = pd.read_csv(r'C:\Users\lin-a\Desktop\data\2017_1_data.csv')
    bike1
|  Duration (ms)  |  Start date  |  End date  |  Start station  |  End station
|  Bike number  |  Member type
---|---|---|---|---|---|---|---
0  |  221834  |  2017/1/1 0:00  |  2017/1/1 0:04  |  3rd & Tingey St SE  |  M
St & New Jersey Ave SE  |  W00869  |  Member
1  |  1676854  |  2017/1/1 0:06  |  2017/1/1 0:34  |  Lincoln Memorial  |  8th
& D St NW  |  W00894  |  Casual
2  |  1356956  |  2017/1/1 0:07  |  2017/1/1 0:29  |  Henry Bacon Dr & Lincoln
Memorial Circle NW  |  New York Ave & 15th St NW  |  W21945  |  Casual
3  |  1327901  |  2017/1/1 0:07  |  2017/1/1 0:29  |  Henry Bacon Dr & Lincoln
Memorial Circle NW  |  New York Ave & 15th St NW  |  W20012  |  Casual
4  |  1636768  |  2017/1/1 0:07  |  2017/1/1 0:34  |  Lincoln Memorial  |  8th
& D St NW  |  W22786  |  Casual
...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...
42400  |  262484  |  2017/1/10 23:52  |  2017/1/10 23:56  |  Georgia & New
Hampshire Ave NW  |  11th & Kenyon St NW  |  W01445  |  Member
42401  |  630296  |  2017/1/10 23:52  |  2017/1/11 0:03  |  17th & K St NW /
Farragut Square  |  21st & M St NW  |  W00097  |  Casual
42402  |  114584  |  2017/1/10 23:53  |  2017/1/10 23:55  |  Montgomery & East
Ln  |  Bethesda Ave & Arlington Rd  |  W20637  |  Member
42403  |  564322  |  2017/1/10 23:53  |  2017/1/11 0:03  |  17th & K St NW /
Farragut Square  |  21st & M St NW  |  W01377  |  Casual
42404  |  809980  |  2017/1/10 23:59  |  2017/1/11 0:13  |  4th St & Madison
Dr NW  |  North Capitol St & F St NW  |  W22436  |  Member
42405 rows × 7 columns
    # 读取第二份数据
    bike2 = pd.read_csv(r'C:\Users\lin-a\Desktop\data\2017_2_data.csv')
    bike2
|  Duration (ms)  |  Start date  |  End date  |  Start station  |  End station
|  Bike number  |  Member type
---|---|---|---|---|---|---|---
0  |  636921  |  2017/2/1 0:00  |  2017/2/1 0:11  |  Park Rd & Holmead Pl NW
|  14th & Upshur St NW  |  W01023  |  Casual
1  |  399528  |  2017/2/1 0:01  |  2017/2/1 0:08  |  M St & Pennsylvania Ave
NW  |  21st & M St NW  |  W20687  |  Casual
2  |  186311  |  2017/2/1 0:05  |  2017/2/1 0:08  |  Massachusetts Ave &
Dupont Circle NW  |  17th & Corcoran St NW  |  W21637  |  Member
3  |  746740  |  2017/2/1 0:05  |  2017/2/1 0:17  |  11th & M St NW  |  16th &
Harvard St NW  |  W22297  |  Member
4  |  508111  |  2017/2/1 0:05  |  2017/2/1 0:14  |  14th & R St NW  |  18th
St & Wyoming Ave NW  |  W22024  |  Member
...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...
72875  |  667119  |  2017/2/10 23:57  |  2017/2/11 0:08  |  24th & N St NW  |
15th & L St NW  |  W21336  |  Member
72876  |  569178  |  2017/2/10 23:58  |  2017/2/11 0:08  |  7th & T St NW  |
18th & R St NW  |  W01405  |  Member
72877  |  308804  |  2017/2/10 23:59  |  2017/2/11 0:04  |  18th & M St NW  |
New Hampshire Ave & T St NW  |  W22673  |  Member
72878  |  1087725  |  2017/2/10 23:59  |  2017/2/11 0:17  |  7th & F St NW /
National Portrait Gallery  |  19th & East Capitol St SE  |  W00254  |  Member
72879  |  572705  |  2017/2/10 23:59  |  2017/2/11 0:09  |  11th & H St NE  |
4th & East Capitol St NE  |  W21512  |  Member
72880 rows × 7 columns
    # 合并数据
    # 因为各个参数都是默认,也可以采用pd.concat([bike1,bike2])
    bike3 = pd.concat([bike1,bike2],axis=0,join='outer')
    bike3
|  Duration (ms)  |  Start date  |  End date  |  Start station  |  End station
|  Bike number  |  Member type
---|---|---|---|---|---|---|---
0  |  221834  |  2017/1/1 0:00  |  2017/1/1 0:04  |  3rd & Tingey St SE  |  M
St & New Jersey Ave SE  |  W00869  |  Member
1  |  1676854  |  2017/1/1 0:06  |  2017/1/1 0:34  |  Lincoln Memorial  |  8th
& D St NW  |  W00894  |  Casual
2  |  1356956  |  2017/1/1 0:07  |  2017/1/1 0:29  |  Henry Bacon Dr & Lincoln
Memorial Circle NW  |  New York Ave & 15th St NW  |  W21945  |  Casual
3  |  1327901  |  2017/1/1 0:07  |  2017/1/1 0:29  |  Henry Bacon Dr & Lincoln
Memorial Circle NW  |  New York Ave & 15th St NW  |  W20012  |  Casual
4  |  1636768  |  2017/1/1 0:07  |  2017/1/1 0:34  |  Lincoln Memorial  |  8th
& D St NW  |  W22786  |  Casual
...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...
72875  |  667119  |  2017/2/10 23:57  |  2017/2/11 0:08  |  24th & N St NW  |
15th & L St NW  |  W21336  |  Member
72876  |  569178  |  2017/2/10 23:58  |  2017/2/11 0:08  |  7th & T St NW  |
18th & R St NW  |  W01405  |  Member
72877  |  308804  |  2017/2/10 23:59  |  2017/2/11 0:04  |  18th & M St NW  |
New Hampshire Ave & T St NW  |  W22673  |  Member
72878  |  1087725  |  2017/2/10 23:59  |  2017/2/11 0:17  |  7th & F St NW /
National Portrait Gallery  |  19th & East Capitol St SE  |  W00254  |  Member
72879  |  572705  |  2017/2/10 23:59  |  2017/2/11 0:09  |  11th & H St NE  |
4th & East Capitol St NE  |  W21512  |  Member
115285 rows × 7 columns
    # 降序排列骑行时间
    bike3.sort_values(by='Duration (ms)',ascending=False,inplace=True)
    bike3
|  Duration (ms)  |  Start date  |  End date  |  Start station  |  End station
|  Bike number  |  Member type
---|---|---|---|---|---|---|---
24905  |  84876226  |  2017/1/6 10:59  |  2017/1/7 10:33  |  17th & K St NW /
Farragut Square  |  19th & K St NW  |  W23232  |  Member
44077  |  84070936  |  2017/2/7 10:44  |  2017/2/8 10:05  |  Thomas Circle  |
4th & C St SW  |  W21615  |  Casual
35857  |  83789379  |  2017/2/6 12:35  |  2017/2/7 11:51  |  Fleet St &
Ritchie Pkwy  |  E Montgomery Ave & Maryland Ave  |  W22565  |  Casual
35874  |  83662979  |  2017/2/6 12:38  |  2017/2/7 11:53  |  Fleet St &
Ritchie Pkwy  |  E Montgomery Ave & Maryland Ave  |  W21846  |  Casual
29204  |  82897521  |  2017/1/7 11:42  |  2017/1/8 10:44  |  8th & H St NW  |
4th St & Madison Dr NW  |  W21095  |  Casual
...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...  |  ...
22670  |  60614  |  2017/2/4 8:29  |  2017/2/4 8:30  |  18th & R St NW  |
18th & R St NW  |  W22795  |  Member
48831  |  60606  |  2017/2/7 18:31  |  2017/2/7 18:32  |  Kennedy Center  |
Kennedy Center  |  W22223  |  Casual
60148  |  60504  |  2017/2/8 19:14  |  2017/2/8 19:15  |  Columbus Circle /
Union Station  |  Columbus Circle / Union Station  |  W21323  |  Casual
37585  |  60459  |  2017/2/6 16:53  |  2017/2/6 16:54  |  Columbus Circle /
Union Station  |  Columbus Circle / Union Station  |  W21710  |  Member
48652  |  60452  |  2017/2/7 18:21  |  2017/2/7 18:22  |  Towers Crescent Dr &
Tysons One Pl  |  Towers Crescent Dr & Tysons One Pl  |  W23040  |  Member
115285 rows × 7 columns
    # 获取最长时间
    long_time = bike3.iloc[0]['Duration (ms)']
    long_time
    84876226
- 根据第六次全国人口普查中的常住人口数据,获取全国人口最多的10个城市。数据信息:共包含省、地区、结尾、常住人口4个字段。
- series.str会将每个数据转换成字符串
- contains()判断字符串是否含有指定字段,返回的是bool类型
    import pandas as pd
    # 读取数据
    people = pd.read_csv(r'C:\Users\lin-a\Desktop\data\liupu.csv')
    
    # 筛选数据
    bools = people['结尾'].str.contains('市')
    people_2 = people[bools]
    
    # 根据人口数据降序排列
    people_3 = people_2.sort_values(by='常住人口',ascending=False,inplace=False)
    # 获取前10个数据
    people_3.head(10)
|  省  |  地区  |  结尾  |  常住人口
---|---|---|---|---
567  |  上海市  |  上海市  |  市  |  23019196.0
23  |  北京市  |  北京市  |  市  |  19612368.0
569  |  四川省  |  成都市  |  市  |  14047625.0
601  |  天津市  |  天津市  |  市  |  12938693.0
73  |  广东省  |  广州市  |  市  |  12701948.0
157  |  河北省  |  保定市  |  市  |  11194382.0
232  |  黑龙江省  |  哈尔滨市  |  市  |  10635971.0
378  |  江苏省  |  苏州市  |  市  |  10459890.0
95  |  广东省  |  深圳市  |  市  |  10358381.0
204  |  河南省  |  南阳市  |  市  |  10263660.0

 
                    
                 
 
                
            
         浙公网安备 33010602011771号
浙公网安备 33010602011771号