夜的独白

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

一、数据合并

数据合并主要包括两种操作:

  • 轴向连接(concatenation):pd.concat()可以沿一个轴将多个DataFrame对象连接在一起,形成一个新的DataFrame对象
  • 融合(merging):pd.merge()方法可以根据一个或多个键将不同的DataFrame中的行连接在一起

1. 轴向连接pd.concat

pd.concat()函数可以将根据不同的轴进行合并,常用参数如下:

pd.concat(objs,axis=0,join='outer')

  • objs:Series、DataFrame或者是panel构成的序列list
  • axis:需要合并链接的轴,0是行,1是列,默认为0
  • join:连接的方式inner,或者outer,默认是outer
    import pandas as pd
    
    dict_1 = {
             'A':['A0','A1','A2','A3'],
             'B':['B0','B1','B2','B3'],
             'C':['C0','C1','C2','C3']}
    df1 = pd.DataFrame(dict_1)
    df1

| A | B | C
---|---|---|---
0 | A0 | B0 | C0
1 | A1 | B1 | C1
2 | A2 | B2 | C2
3 | A3 | B3 | C3

    dict_2 = {
             'B':['B0','B1','B2','B3'],
             'C':['C0','C1','C2','C3'],
             'D':['D0','D1','D2','D3']}
    df2 = pd.DataFrame(dict_2)
    df2

| B | C | D
---|---|---|---
0 | B0 | C0 | D0
1 | B1 | C1 | D1
2 | B2 | C2 | D2
3 | B3 | C3 | D3

当使用pd.concat()默认参数合并df1和df2时:

    df3 = pd.concat([df1,df2], axis=0, join='outer')
    df3

| A | B | C | D
---|---|---|---|---
0 | A0 | B0 | C0 | NaN
1 | A1 | B1 | C1 | NaN
2 | A2 | B2 | C2 | NaN
3 | A3 | B3 | C3 | NaN
0 | NaN | B0 | C0 | D0
1 | NaN | B1 | C1 | D1
2 | NaN | B2 | C2 | D2
3 | NaN | B3 | C3 | D3

通过以上的结果发现,当join=‘outer’,axis参数为0时,列进行并集处理,纵向表拼接,缺失值由NaN填充,并且会保留原有数据的行索引。

如果两个表的Index都没有实际含义,使用ignore_index参数,设置成True,即可重新整理一个新的index。

pd.concat([df1,df2],axis=0,join='outer',ignore_index=True

    df4 = pd.concat([df1,df2],axis=0,join='outer',ignore_index=True)
    df4

| A | B | C | D
---|---|---|---|---
0 | A0 | B0 | C0 | NaN
1 | A1 | B1 | C1 | NaN
2 | A2 | B2 | C2 | NaN
3 | A3 | B3 | C3 | NaN
4 | NaN | B0 | C0 | D0
5 | NaN | B1 | C1 | D1
6 | NaN | B2 | C2 | D2
7 | NaN | B3 | C3 | D3

当我们将concat()的axis参数调整为1,合并df1和df2时:

    df5 = pd.concat([df1,df2],axis=1,join='outer')
    df5

| A | B | C | B | C | D
---|---|---|---|---|---|---
0 | A0 | B0 | C0 | B0 | C0 | D0
1 | A1 | B1 | C1 | B1 | C1 | D1
2 | A2 | B2 | C2 | B2 | C2 | D2
3 | A3 | B3 | C3 | B3 | C3 | D3

    当我们将concat()的join参数调整为inner,合并df1和df2时:
    df6 = pd.concat([df1,df2],axis=0,join='inner')
    df6

| B | C
---|---|---
0 | B0 | C0
1 | B1 | C1
2 | B2 | C2
3 | B3 | C3
0 | B0 | C0
1 | B1 | C1
2 | B2 | C2
3 | B3 | C3

通过以上的结果可知:

  • 如果join参数为inner,得到的是两表的交集;
  • 如果join参数为outer,得到的是两表的并集。

2. 融合pd.merge()

pd.merge()的常用参数如下:

pd.merge(left,right,how='inner',on=None)

参数介绍:

  • left和right:两个要合并的DataFrame
  • how:连接方式,有inner、left、right、outer,默认为inner
  • on:指的是用于连接的列索引名称,必须存在于左右两个DataFrame,如果没有指定且其他参数也没有指定,则以两个DataFrame列名交集作为连接键
    import pandas as pd
    left = pd.DataFrame({'key':['a','b','b','d'],'data1':range(4)})
    left

[/code]

|  key  |  data1  
---|---|---  
0  |  a  |  0  
1  |  b  |  1  
2  |  b  |  2  
3  |  d  |  3

```code
    right = pd.DataFrame({'key':['a','b','c'],'data2':range(3)})
    right

| key | data2
---|---|---
0 | a | 0
1 | b | 1
2 | c | 2

    df7 = pd.merge(left,right)
    df7

| key | data1 | data2
---|---|---|---
0 | a | 0 | 0
1 | b | 1 | 1
2 | b | 2 | 1

merge()默认采用inner连接,并且使用两个DataFrame的列名交集(key)作为连接键,同意,最终连接的数据也是两个DataFrame
key列数据的交集

当两个DataFrame使用outer连接时:

    df8= pd.merge(left,right,on=['key'],how='outer')
    df8

| key | data1 | data2
---|---|---|---
0 | a | 0.0 | 0.0
1 | b | 1.0 | 1.0
2 | b | 2.0 | 1.0
3 | d | 3.0 | NaN
4 | c | NaN | 2.0

当merge()做outer连接时,最终连接的数据是两个DataFrame key列数据的并集,确实内容由NaN填充

当两个DataFrame使用left连接时:

    df9 = pd.merge(left,right,on=['key'],how='left')
    df9

| key | data1 | data2
---|---|---|---
0 | a | 0 | 0.0
1 | b | 1 | 1.0
2 | b | 2 | 1.0
3 | d | 3 | NaN

当merge()做left连接时,最终连接的数据是以left数据的链接键为基准合并两个数据的列数据,确实的内容由NaN填充

那么,当两个DataFrame使用right连接时,会发生什么呢?

    df10= pd.merge(left,right,on=['key'],how='right')
    df10

| key | data1 | data2
---|---|---|---
0 | a | 0.0 | 0
1 | b | 1.0 | 1
2 | b | 2.0 | 1
3 | c | NaN | 2

两种合并方式的应用场景:

  • 有两张表格分别存储了9月和10月份的成交信息,此时可以使用concat()将两个表沿0轴合并
  • 现在有两张表格,一个是城郊型,包含订单号、金额、客户ID等;第二个是客户信息,包含客户ID、姓名、电话号码等信息,此时可以使用merge()根据客户ID讲两个表合并成一个完整的表

二、数据的筛选

前面的内容中,我们学习了如何获取一条数据或连续的多条数据,但实际工作中我们经常需要处理上万条。,甚至更多的数据。
那么,要如何才能快速的筛选出符合条件的数据呢?

    import pandas as pd
    
    # 导入数据
    df = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
    # 了解数据基本情况
    print(df.shape)
    df.head()
    (72756, 7)

| _id | 关注的收藏夹 | 关注 | 关注者 | 关注的问题 | 关注的话题 | 关注的专栏
---|---|---|---|---|---|---|---
0 | 587598f89f11daf90617fb7a | 52 | 17 | 1 | 30 | 58 | 2
1 | 587598f89f11daf90617fb7c | 27 | 73 | 15 | 87 | 26 | 1
2 | 587598f89f11daf90617fb7e | 72 | 94 | 1 | 112 | 20 | 4
3 | 587598f89f11daf90617fb80 | 174 | 84 | 8 | 895 | 30 | 7
4 | 587598f89f11daf90617fb82 | 3 | 236 | 64 | 119 | 44 | 17

1. 要将关注者超过100的用户数据获取出来

    bools = df['关注者']>100
    bools
    0        False
    1        False
    2        False
    3        False
    4        False
             ...  
    72751    False
    72752    False
    72753    False
    72754    False
    72755    False
    Name: 关注者, Length: 72756, dtype: bool
    df1 = df[bools]
    print(df1.shape)
    df1
    (5501, 7)

| _id | 关注的收藏夹 | 关注 | 关注者 | 关注的问题 | 关注的话题 | 关注的专栏
---|---|---|---|---|---|---|---
79 | 587598f89f11daf90617fc18 | 8 | 111 | 3049 | 1216 | 45 | 17
101 | 587598f89f11daf90617fc44 | 61 | 123 | 205 | 670 | 24 | 18
112 | 587598f99f11daf90617fc5a | 1 | 804 | 197 | 830 | 39 | 26
114 | 587598f99f11daf90617fc5e | 44 | 62 | 201 | 314 | 8 | 3
121 | 587598f99f11daf90617fc6c | 44 | 628 | 6597 | 1566 | 144 |
69
... | ... | ... | ... | ... | ... | ... | ...
72694 | 5878399e9f11da09c8fe690e | 3 | 172 | 219 | 466 | 68 | 11
72700 | 5878399e9f11da09c8fe691a | 21 | 39 | 281 | 51 | 45 | 2
72713 | 5878399e9f11da09c8fe6934 | 49 | 635 | 248 | 1772 | 151 |
123
72726 | 5878399e9f11da09c8fe694e | 447 | 62 | 728 | 14 | 1 | 21
72741 | 5878399e9f11da09c8fe696c | 3546 | 390 | 142 | 25 | 4 |
126

5501 rows × 7 columns

来复盘一下以上代码的逻辑:

  1. 首先判断每个用户的关注者数量是否大于100,大于则返回True,表示该行被标记为True,否则被标记为False。bools记录了每一行是否符合筛选条件,是一个Series对象,其中的值是bool类型。
    bools = df['关注者']>100
  1. 然后,根据bools每行的值来对df进行筛选,值为True,表示对应的行会留下,否则去除
    df1 = df[bools]
  1. 最后打印的df1数据就是关注者超过100的用户数据。这是pandas根据某列的值进行筛选的基本逻辑

根据某列的值进行筛选的逻辑如上所述,那么,如果要进行多条件的联合筛选呢?

2. 获取关注者超过300切关注的数量超过100的用户数据

    import pandas as pd
    # 导入数据
    df = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
    # 查看数据基本情况
    df

| _id | 关注的收藏夹 | 关注 | 关注者 | 关注的问题 | 关注的话题 | 关注的专栏
---|---|---|---|---|---|---|---
0 | 587598f89f11daf90617fb7a | 52 | 17 | 1 | 30 | 58 | 2
1 | 587598f89f11daf90617fb7c | 27 | 73 | 15 | 87 | 26 | 1
2 | 587598f89f11daf90617fb7e | 72 | 94 | 1 | 112 | 20 | 4
3 | 587598f89f11daf90617fb80 | 174 | 84 | 8 | 895 | 30 | 7
4 | 587598f89f11daf90617fb82 | 3 | 236 | 64 | 119 | 44 | 17
... | ... | ... | ... | ... | ... | ... | ...
72751 | 5878399e9f11da09c8fe6980 | 112 | 77 | 27 | 39 | 2 | 32
72752 | 5878399e9f11da09c8fe6982 | 15 | 90 | 16 | 61 | 116 | 2
72753 | 5878399e9f11da09c8fe6984 | 7 | 19 | 1 | 25 | 12 | 2
72754 | 5878399e9f11da09c8fe6986 | 14 | 320 | 3 | 68 | 38 | 9
72755 | 5878399e9f11da09c8fe6988 | 7 | 1 | 2 | 7 | 15 | 0

72756 rows × 7 columns

    bool_1 = df['关注者']>300
    bool_2 = df['关注']>100
    df2 = df[bool_1 & bool_2]
    df2

| _id | 关注的收藏夹 | 关注 | 关注者 | 关注的问题 | 关注的话题 | 关注的专栏
---|---|---|---|---|---|---|---
79 | 587598f89f11daf90617fc18 | 8 | 111 | 3049 | 1216 | 45 | 17
121 | 587598f99f11daf90617fc6c | 44 | 628 | 6597 | 1566 | 144 |
69
155 | 587598f99f11daf90617fcb0 | 194 | 250 | 1103 | 10 | 1 | 19
228 | 587598f99f11daf90617fd42 | 48 | 224 | 1450 | 360 | 128 |
20
261 | 587598f99f11daf90617fd84 | 1125 | 842 | 24848 | 108 | 33 |
27
... | ... | ... | ... | ... | ... | ... | ...
72173 | 5878399b9f11da09c8fe64fc | 38 | 5000 | 1386 | 2205 | 25 |
6
72416 | 5878399d9f11da09c8fe66e2 | 661 | 299 | 457 | 32 | 4 | 9
72463 | 5878399d9f11da09c8fe6740 | 274 | 209 | 1556 | 2405 | 188
| 318
72648 | 5878399e9f11da09c8fe68b2 | 10 | 278 | 471 | 763 | 66 |
12
72661 | 5878399e9f11da09c8fe68cc | 756 | 491 | 414 | 229 | 2 |
58

1921 rows × 7 columns

复盘以上的代码:

  1. 通过2个限制条件df[‘关注者’]>300和df[‘关注’]>100,分别得到bool_1和bool_2这2个Series
  2. 因为我们的需求是要同时满足两个条件,所以使用逻辑运算符连接两个值
  3. 最后获取同时满足两个条件的值

3. 数据筛选的第二种方法

使用loc[]方法实现筛选的效果。

    import pandas as pd
    
    # 导入数据
    df = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
    # 了解数据基本情况
    print(df.shape)
    df.head()
    (72756, 7)

| _id | 关注的收藏夹 | 关注 | 关注者 | 关注的问题 | 关注的话题 | 关注的专栏
---|---|---|---|---|---|---|---
0 | 587598f89f11daf90617fb7a | 52 | 17 | 1 | 30 | 58 | 2
1 | 587598f89f11daf90617fb7c | 27 | 73 | 15 | 87 | 26 | 1
2 | 587598f89f11daf90617fb7e | 72 | 94 | 1 | 112 | 20 | 4
3 | 587598f89f11daf90617fb80 | 174 | 84 | 8 | 895 | 30 | 7
4 | 587598f89f11daf90617fb82 | 3 | 236 | 64 | 119 | 44 | 17

    # 获取关注者超过100的用户数据
    df_new = df.loc[df['关注者']>100]
    df_new

| _id | 关注的收藏夹 | 关注 | 关注者 | 关注的问题 | 关注的话题 | 关注的专栏
---|---|---|---|---|---|---|---
79 | 587598f89f11daf90617fc18 | 8 | 111 | 3049 | 1216 | 45 | 17
101 | 587598f89f11daf90617fc44 | 61 | 123 | 205 | 670 | 24 | 18
112 | 587598f99f11daf90617fc5a | 1 | 804 | 197 | 830 | 39 | 26
114 | 587598f99f11daf90617fc5e | 44 | 62 | 201 | 314 | 8 | 3
121 | 587598f99f11daf90617fc6c | 44 | 628 | 6597 | 1566 | 144 |
69
... | ... | ... | ... | ... | ... | ... | ...
72694 | 5878399e9f11da09c8fe690e | 3 | 172 | 219 | 466 | 68 | 11
72700 | 5878399e9f11da09c8fe691a | 21 | 39 | 281 | 51 | 45 | 2
72713 | 5878399e9f11da09c8fe6934 | 49 | 635 | 248 | 1772 | 151 |
123
72726 | 5878399e9f11da09c8fe694e | 447 | 62 | 728 | 14 | 1 | 21
72741 | 5878399e9f11da09c8fe696c | 3546 | 390 | 142 | 25 | 4 |
126

5501 rows × 7 columns

三、数据的排序

在数据获取过程中,数据的排序也是我们经常需要处理的问题。例如:我们需要找出关注者数量前10的用户信息

在这里插入图片描述

可以使用sort_index()、sort_values()两个方法对数据进行排序,并且这两个方法Series和DataFrame都支持

那么,DataFrame的这两个方法有什么区别呢?

  • sort_index()方法是按照行进行排序
  • sort_values()可以指定具体列进行排序

1. 根据国家名称来进行排序

    import pandas as pd
    
    people = pd.read_excel(r'C:\Users\lin-a\Desktop\data\rate.xlsx',index_col='Country Code')
    people

| CountryName | 1990 | 2000 | 2007 | 2008 | 2009 | 2010 | 2011
| 2012 | 2013 | 2014 | 2015 | Change 1990-2015 | Change 2007-2015
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
Country Code | | | | | | | | | | | | | |
AFG | Afghanistan | 101.094930 | 103.254202 | 100.000371 |
100.215886 | 100.060480 | 99.459839 | 97.667911 | 95.312707 |
92.602785 | 89.773777 | 86.954464 | -14.140466 | -13.045907
ALB | Albania | 61.808311 | 59.585866 | 50.862987 | 49.663787 |
48.637067 | NaN | 46.720288 | 45.835739 | 45.247477 | 44.912168 |
44.806973 | -17.001338 | -6.056014
DZA | Algeria | 87.675705 | 62.886169 | 49.487870 | 48.910002 |
48.645026 | 48.681853 | 49.233576 | 49.847713 | 50.600697 |
51.536631 | 52.617579 | -35.058127 | 3.129709
ASM | American Samoa | NaN | NaN | NaN | NaN | NaN | NaN | NaN
| NaN | NaN | NaN | NaN | NaN | NaN
ADO | Andorra | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
NaN | NaN | NaN | NaN | NaN | NaN
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ...
VIR | Virgin Islands (U.S.) | 53.546901 | 52.647183 | 49.912779 |
50.459425 | 51.257336 | 52.382523 | 53.953515 | 55.687666 |
57.537152 | 59.399244 | 61.199651 | 7.652751 | 11.286872
WBG | West Bank and Gaza | 102.789182 | 100.410470 | 88.367099 |
86.204936 | 84.173732 | 82.333762 | 80.851747 | 79.426778 |
78.118403 | 76.975462 | 76.001869 | -26.787313 | -12.365230
YEM | Yemen, Rep. | 118.779727 | 105.735754 | 88.438350 | 85.899643
| 83.594489 | 81.613924 | 80.193948 | 78.902603 | 77.734373 |
76.644268 | 75.595147 | -43.184581 | -12.843203
ZMB | Zambia | 100.485263 | 97.124776 | 98.223058 | 98.277253 |
98.260795 | 98.148325 | 97.854237 | 97.385802 | 96.791310 |
96.122165 | 95.402326 | -5.082936 | -2.820732
ZWE | Zimbabwe | 96.365244 | 84.877487 | 81.272166 | 81.024020 |
80.934968 | 80.985702 | 80.740494 | 80.579870 | 80.499816 |
80.456439 | 80.391033 | -15.974211 | -0.881133

219 rows × 14 columns

    people.sort_index(inplace=True,ascending=True)
    people.head()

| CountryName | 1990 | 2000 | 2007 | 2008 | 2009 | 2010 | 2011
| 2012 | 2013 | 2014 | 2015 | Change 1990-2015 | Change 2007-2015
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
Country Code | | | | | | | | | | | | | |
ABW | Aruba | 47.500831 | 44.420778 | 43.475981 | 43.957839 |
44.531061 | 45.157235 | 44.993173 | 44.706680 | 44.355302 |
44.070252 | 43.952390 | -3.548441 | 0.476409
ADO | Andorra | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
NaN | NaN | NaN | NaN | NaN | NaN
AFG | Afghanistan | 101.094930 | 103.254202 | 100.000371 |
100.215886 | 100.060480 | 99.459839 | 97.667911 | 95.312707 |
92.602785 | 89.773777 | 86.954464 | -14.140466 | -13.045907
AGO | Angola | 101.394722 | 100.930475 | 102.563811 | 102.609186 |
102.428788 | 102.035690 | 102.106756 | 101.836900 | 101.315234 |
100.637667 | 99.855751 | -1.538971 | -2.708060
ALB | Albania | 61.808311 | 59.585866 | 50.862987 | 49.663787 |
48.637067 | NaN | 46.720288 | 45.835739 | 45.247477 | 44.912168 |
44.806973 | -17.001338 | -6.056014

复盘以上代码:

  1. read_excel()中的参数index_col='Country Code’作用是在读取文件的时候指定Country Code这一列数据作为行索引
  2. inplace=True参数和我们之前见过的作用一样,用来控制是否直接对原始数据进行修改
  3. ascengding可以控制排序的顺序,默认值True小到大升序排列,但设置为False时就可以实现降序排列

2. 获取关注者数据前十的用户数据

那么我们需要根据关注者这一列数据进行排序。

    import pandas as pd
    
    user = pd.read_csv(r'C:\Users\lin-a\Desktop\data\mouhu_users_2017.csv')
    user.sort_values(by='关注者',ascending=False,inplace=True)
    user.head(10)

| _id | 关注的收藏夹 | 关注 | 关注者 | 关注的问题 | 关注的话题 | 关注的专栏
---|---|---|---|---|---|---|---
23313 | 5876ee159f11da03c9726155 | 3181 | 1053 | 585787 | 93 | 33
| 26
14461 | 5876ede89f11da03c9721c33 | 1041 | 254 | 356246 | 92 | 3 |
23
22048 | 5876ee0c9f11da03c9725773 | 41 | 72 | 351414 | 336 | 47 |
16
52464 | 5876ece59f11da03afbc71b3 | 1652 | 794 | 308553 | 23 | 1 |
112
42135 | 5876eb5a9f11da0399d461b7 | 1182 | 532 | 268353 | 23 | 1 |
54
63579 | 5878396e9f11da09c8fe21e6 | 10819 | 1412 | 198199 | 90 | 1
| 517
21740 | 5876ee0b9f11da03c972550b | 991 | 479 | 195175 | 46 | 3 |
18
56721 | 5876ecfa9f11da03afbc92f1 | 164 | 550 | 164254 | 0 | 1 |
19
22784 | 5876ee129f11da03c9725d33 | 2011 | 579 | 148457 | 49 | 6 |
44
29577 | 587838f59f11da09bd427945 | 8 | 303 | 137087 | 477 | 19 |
15

复盘以上代码:

  1. by:决定了是按照数据中的哪一列进行排序,将需要按照某列排序的列名赋值给by即可
  2. ascending=False:将数据按大到小的顺序排列
  3. inplace=Ture:用来控制是否直接对原始数据进行修改

四、总结

数据合并

在这里插入图片描述

数据筛选

在这里插入图片描述

数据排序

在这里插入图片描述

五、练习

  1. 目前手上有两份数据,将两个csv数据合并成一个,并按骑行时间进行降序排列,获取最长的骑行时间。
    import pandas as pd
    # 读取第一份数据
    bike1 = pd.read_csv(r'C:\Users\lin-a\Desktop\data\2017_1_data.csv')
    bike1

| Duration (ms) | Start date | End date | Start station | End station
| Bike number | Member type
---|---|---|---|---|---|---|---
0 | 221834 | 2017/1/1 0:00 | 2017/1/1 0:04 | 3rd & Tingey St SE | M
St & New Jersey Ave SE | W00869 | Member
1 | 1676854 | 2017/1/1 0:06 | 2017/1/1 0:34 | Lincoln Memorial | 8th
& D St NW | W00894 | Casual
2 | 1356956 | 2017/1/1 0:07 | 2017/1/1 0:29 | Henry Bacon Dr & Lincoln
Memorial Circle NW | New York Ave & 15th St NW | W21945 | Casual
3 | 1327901 | 2017/1/1 0:07 | 2017/1/1 0:29 | Henry Bacon Dr & Lincoln
Memorial Circle NW | New York Ave & 15th St NW | W20012 | Casual
4 | 1636768 | 2017/1/1 0:07 | 2017/1/1 0:34 | Lincoln Memorial | 8th
& D St NW | W22786 | Casual
... | ... | ... | ... | ... | ... | ... | ...
42400 | 262484 | 2017/1/10 23:52 | 2017/1/10 23:56 | Georgia & New
Hampshire Ave NW | 11th & Kenyon St NW | W01445 | Member
42401 | 630296 | 2017/1/10 23:52 | 2017/1/11 0:03 | 17th & K St NW /
Farragut Square | 21st & M St NW | W00097 | Casual
42402 | 114584 | 2017/1/10 23:53 | 2017/1/10 23:55 | Montgomery & East
Ln | Bethesda Ave & Arlington Rd | W20637 | Member
42403 | 564322 | 2017/1/10 23:53 | 2017/1/11 0:03 | 17th & K St NW /
Farragut Square | 21st & M St NW | W01377 | Casual
42404 | 809980 | 2017/1/10 23:59 | 2017/1/11 0:13 | 4th St & Madison
Dr NW | North Capitol St & F St NW | W22436 | Member

42405 rows × 7 columns

    # 读取第二份数据
    bike2 = pd.read_csv(r'C:\Users\lin-a\Desktop\data\2017_2_data.csv')
    bike2

| Duration (ms) | Start date | End date | Start station | End station
| Bike number | Member type
---|---|---|---|---|---|---|---
0 | 636921 | 2017/2/1 0:00 | 2017/2/1 0:11 | Park Rd & Holmead Pl NW
| 14th & Upshur St NW | W01023 | Casual
1 | 399528 | 2017/2/1 0:01 | 2017/2/1 0:08 | M St & Pennsylvania Ave
NW | 21st & M St NW | W20687 | Casual
2 | 186311 | 2017/2/1 0:05 | 2017/2/1 0:08 | Massachusetts Ave &
Dupont Circle NW | 17th & Corcoran St NW | W21637 | Member
3 | 746740 | 2017/2/1 0:05 | 2017/2/1 0:17 | 11th & M St NW | 16th &
Harvard St NW | W22297 | Member
4 | 508111 | 2017/2/1 0:05 | 2017/2/1 0:14 | 14th & R St NW | 18th
St & Wyoming Ave NW | W22024 | Member
... | ... | ... | ... | ... | ... | ... | ...
72875 | 667119 | 2017/2/10 23:57 | 2017/2/11 0:08 | 24th & N St NW |
15th & L St NW | W21336 | Member
72876 | 569178 | 2017/2/10 23:58 | 2017/2/11 0:08 | 7th & T St NW |
18th & R St NW | W01405 | Member
72877 | 308804 | 2017/2/10 23:59 | 2017/2/11 0:04 | 18th & M St NW |
New Hampshire Ave & T St NW | W22673 | Member
72878 | 1087725 | 2017/2/10 23:59 | 2017/2/11 0:17 | 7th & F St NW /
National Portrait Gallery | 19th & East Capitol St SE | W00254 | Member
72879 | 572705 | 2017/2/10 23:59 | 2017/2/11 0:09 | 11th & H St NE |
4th & East Capitol St NE | W21512 | Member

72880 rows × 7 columns

    # 合并数据
    # 因为各个参数都是默认,也可以采用pd.concat([bike1,bike2])
    bike3 = pd.concat([bike1,bike2],axis=0,join='outer')
    bike3

| Duration (ms) | Start date | End date | Start station | End station
| Bike number | Member type
---|---|---|---|---|---|---|---
0 | 221834 | 2017/1/1 0:00 | 2017/1/1 0:04 | 3rd & Tingey St SE | M
St & New Jersey Ave SE | W00869 | Member
1 | 1676854 | 2017/1/1 0:06 | 2017/1/1 0:34 | Lincoln Memorial | 8th
& D St NW | W00894 | Casual
2 | 1356956 | 2017/1/1 0:07 | 2017/1/1 0:29 | Henry Bacon Dr & Lincoln
Memorial Circle NW | New York Ave & 15th St NW | W21945 | Casual
3 | 1327901 | 2017/1/1 0:07 | 2017/1/1 0:29 | Henry Bacon Dr & Lincoln
Memorial Circle NW | New York Ave & 15th St NW | W20012 | Casual
4 | 1636768 | 2017/1/1 0:07 | 2017/1/1 0:34 | Lincoln Memorial | 8th
& D St NW | W22786 | Casual
... | ... | ... | ... | ... | ... | ... | ...
72875 | 667119 | 2017/2/10 23:57 | 2017/2/11 0:08 | 24th & N St NW |
15th & L St NW | W21336 | Member
72876 | 569178 | 2017/2/10 23:58 | 2017/2/11 0:08 | 7th & T St NW |
18th & R St NW | W01405 | Member
72877 | 308804 | 2017/2/10 23:59 | 2017/2/11 0:04 | 18th & M St NW |
New Hampshire Ave & T St NW | W22673 | Member
72878 | 1087725 | 2017/2/10 23:59 | 2017/2/11 0:17 | 7th & F St NW /
National Portrait Gallery | 19th & East Capitol St SE | W00254 | Member
72879 | 572705 | 2017/2/10 23:59 | 2017/2/11 0:09 | 11th & H St NE |
4th & East Capitol St NE | W21512 | Member

115285 rows × 7 columns

    # 降序排列骑行时间
    bike3.sort_values(by='Duration (ms)',ascending=False,inplace=True)
    bike3

| Duration (ms) | Start date | End date | Start station | End station
| Bike number | Member type
---|---|---|---|---|---|---|---
24905 | 84876226 | 2017/1/6 10:59 | 2017/1/7 10:33 | 17th & K St NW /
Farragut Square | 19th & K St NW | W23232 | Member
44077 | 84070936 | 2017/2/7 10:44 | 2017/2/8 10:05 | Thomas Circle |
4th & C St SW | W21615 | Casual
35857 | 83789379 | 2017/2/6 12:35 | 2017/2/7 11:51 | Fleet St &
Ritchie Pkwy | E Montgomery Ave & Maryland Ave | W22565 | Casual
35874 | 83662979 | 2017/2/6 12:38 | 2017/2/7 11:53 | Fleet St &
Ritchie Pkwy | E Montgomery Ave & Maryland Ave | W21846 | Casual
29204 | 82897521 | 2017/1/7 11:42 | 2017/1/8 10:44 | 8th & H St NW |
4th St & Madison Dr NW | W21095 | Casual
... | ... | ... | ... | ... | ... | ... | ...
22670 | 60614 | 2017/2/4 8:29 | 2017/2/4 8:30 | 18th & R St NW |
18th & R St NW | W22795 | Member
48831 | 60606 | 2017/2/7 18:31 | 2017/2/7 18:32 | Kennedy Center |
Kennedy Center | W22223 | Casual
60148 | 60504 | 2017/2/8 19:14 | 2017/2/8 19:15 | Columbus Circle /
Union Station | Columbus Circle / Union Station | W21323 | Casual
37585 | 60459 | 2017/2/6 16:53 | 2017/2/6 16:54 | Columbus Circle /
Union Station | Columbus Circle / Union Station | W21710 | Member
48652 | 60452 | 2017/2/7 18:21 | 2017/2/7 18:22 | Towers Crescent Dr &
Tysons One Pl | Towers Crescent Dr & Tysons One Pl | W23040 | Member

115285 rows × 7 columns

    # 获取最长时间
    long_time = bike3.iloc[0]['Duration (ms)']
    long_time
    84876226
  1. 根据第六次全国人口普查中的常住人口数据,获取全国人口最多的10个城市。数据信息:共包含省、地区、结尾、常住人口4个字段。
  • series.str会将每个数据转换成字符串
  • contains()判断字符串是否含有指定字段,返回的是bool类型
    import pandas as pd
    # 读取数据
    people = pd.read_csv(r'C:\Users\lin-a\Desktop\data\liupu.csv')
    
    # 筛选数据
    bools = people['结尾'].str.contains('市')
    people_2 = people[bools]
    
    # 根据人口数据降序排列
    people_3 = people_2.sort_values(by='常住人口',ascending=False,inplace=False)
    # 获取前10个数据
    people_3.head(10)

| 省 | 地区 | 结尾 | 常住人口
---|---|---|---|---
567 | 上海市 | 上海市 | 市 | 23019196.0
23 | 北京市 | 北京市 | 市 | 19612368.0
569 | 四川省 | 成都市 | 市 | 14047625.0
601 | 天津市 | 天津市 | 市 | 12938693.0
73 | 广东省 | 广州市 | 市 | 12701948.0
157 | 河北省 | 保定市 | 市 | 11194382.0
232 | 黑龙江省 | 哈尔滨市 | 市 | 10635971.0
378 | 江苏省 | 苏州市 | 市 | 10459890.0
95 | 广东省 | 深圳市 | 市 | 10358381.0
204 | 河南省 | 南阳市 | 市 | 10263660.0

在这里插入图片描述

posted on 2021-07-02 13:46  夜的独白  阅读(832)  评论(0)    收藏  举报