python学习笔记(一)matplotlib、numpy、Pandas

整合自网络与https://space.bilibili.com/243821484?from=search&seid=8124768530697300938

 

2.numpy

  • 2.1 平均值

   使用np.mean()函数,numpy.mean(a, axis, dtype)

   假设a为[time,lat,lon]的数据,那么

    ·axis 不设置值,对 timelatlon 个值求均值,返回一个数
    ·axis = 0:压缩时间维,对每一个经纬点求均值,返回 [lat, lon] 数组(如求一个场的N年气候态)
    ·axis =1,2 :压经度纬度,对每个时间求平均值,返回 [time] 矩阵(如求某时间序列,或指数)

   需要特别注意的是,气象数据中常有缺测,在NCL中,使用求均值函数会自动略过,而在python中,当任意一数与缺测(np.nan)计算的结果均为np.nan,比如求[1,2,3,4,np.nan]的平均值,结果为np.nan
   因此,当数据存在缺测数据时,通常使用np.nanmean()函数,用法同上,此时[1,2,3,4,np.nan]的平均值为(1+2+3+4)/4 = 2.5
   同样的,求某数组最大最小值时也有np.nanmax(), np.nanmin()函数来补充np.max(), np.min()的不足。
   其他很多np的计算函数也可以通过在前边加‘nan’来使用。

 

  • 2.2 增减维数  

  增加

  在操作数组情况下,需要按照某个轴将不同数组的维度对齐,这时候需要为数组添加维度(特别是将二维数组变成高维张量的情况下)。numpy提供了expand_dims()函数来为数组增加维度:

  

 1 import numpy as np
 2 
 3 a = np.array([[1,2],[3,4]])
 4 a.shape
 5 print(a)
 6 >>>
 7 """
 8 (2L, 2L)
 9 [[1 2]
10  [3 4]]
11 """
12 # 如果需要在数组上增加维度,输入需要增添维度的轴即可,注意index从零还是
13 a_add_dimension = np.expand_dims(a,axis=0)
14 a_add_dimension.shape
15 >>> (1L, 2L, 2L)
16 
17 a_add_dimension2 = np.expand_dims(a,axis=-1)
18 a_add_dimension2.shape
19 >>> (2L, 2L, 1L)
20 
21 
22 a_add_dimension3 = np.expand_dims(a,axis=1)
23 a_add_dimension3.shape
24 >>> (2L, 1L, 2L)

 

  减少

  在数组中会存在很多轴只有1维的情况,可以使用squeeze函数来压缩冗余维度
 
  reshape函数
  
1 e= np.arange(10)
2 e
3 array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
1 e.reshape(1,1,10)
2 array([[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]])
 1 e.reshape(1,1,10)
 2 array([[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]])
 3 e.reshape(1,10,1)
 4 array([[[0],
 5         [1],
 6         [2],
 7         [3],
 8         [4],
 9         [5],
10         [6],
11         [7],
12         [8],
13         [9]]])

 

  squeeze 函数:从数组的形状中删除单维度条目,即把shape中为1的维度去掉

  用法:numpy.squeeze(a,axis = None)

  1. a表示输入的数组;
  2. axis用于指定需要删除的维度,但是指定的维度必须为单维度,否则将会报错;
  3. axis的取值可为None int tuple of ints, 可选。若axis为空,则删除所有单维度的条目;
  4. 返回值:数组
  5. 不会修改原数组;
1 a = e.reshape(1,1,10)
2 a
3 array([[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]])
4 np.squeeze(a)
5 array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

3.Pandas

  如果用 python 的列表和字典来作比较, 那么可以说 Numpy 是列表形式的,没有数值标签,而 Pandas 就是字典形式

1 import pandas as pd
2 import numpy as np
3 s = pd.Series([1,3,6,np.nan,44,1])
5 print(s)
###################
0     1.0
1     3.0
2     6.0
3     NaN
4    44.0
5     1.0
dtype: float64
###################

  Series的字符串表现形式为:索引在左边,值在右边。由于我们没有为数据指定索引。

  • 3.1DataFrame

1 dates = pd.date_range('20160101',periods=6)
2 print(dates)
3 df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d']) # 行 列
5 print(df)
#############################################################################
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', freq='D')
                   a         b         c         d
2016-01-01 -0.362729  0.025856 -0.453970  0.521317
2016-01-02 -0.694964 -0.418078 -0.034875 -0.382649
2016-01-03 -1.308891 -0.465486 -0.892237 -0.094203
2016-01-04  0.331540  0.621307  0.033407 -1.490113
2016-01-05 -1.770037  1.443139 -0.465179 -1.571931
2016-01-06  0.017418 -0.007310  1.151194 -0.043637
#############################################################################

  

  DataFrame是一个表格型的数据结构,它包含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)。DataFrame既有行索引也有列索引, 它可以被看做由Series组成的大字典。

 

  选择显示pd其中一行

1 print(df['b'])
########################
2016-01-01    0.743081
2016-01-02   -0.558816
2016-01-03    0.287229
2016-01-04    1.850405
2016-01-05    0.619291
2016-01-06    0.847188
Freq: D, Name: b, dtype: float64
########################

 

   

  不选择显示列索引,默认从零开始

1 df1 = pd.DataFrame(np.arange(12).reshape((3,4)))
2 print(df1)
##########
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
########## 

  

 

  显示列的序号、数据的名称、所有的值

 

 1 df2 = pd.DataFrame({'A' : 1.,
 2                     'B' : pd.Timestamp('20130102'),
 3                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
 4                     'D' : np.array([3] * 4,dtype='int32'),
 5                     'E' : pd.Categorical(["test","train","test","train"]),
 6                     'F' : 'foo'})
 7                     
 8 print(df2)
"""
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
"""
1 print(df2.index)
"""
Int64Index([0, 1, 2, 3], dtype='int64')
"""
1 print(df2.columns)
"""
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
"""
1 print(df2.values)
"""
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)
"""

  

 

  显示行索引信息

1 print(df2.dtypes)
"""
df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
"""

 

 

  数据的总结。只针对数值类型

1 df2.describe()
        A      C      D
count    4.0    4.0    4.0
mean    1.0    1.0    3.0
std    0.0    0.0    0.0
min    1.0    1.0    3.0
25%    1.0    1.0    3.0
50%    1.0    1.0    3.0
75%    1.0    1.0    3.0
max    1.0    1.0    3.0

 

  翻转数据

1 print(df2.T)
                     0                    1                    2  \
A                    1                    1                    1   
B  2013-01-02 00:00:00  2013-01-02 00:00:00  2013-01-02 00:00:00   
C                    1                    1                    1   
D                    3                    3                    3   
E                 test                train                 test   
F                  foo                  foo                  foo   

                     3  
A                    1  
B  2013-01-02 00:00:00  
C                    1  
D                    3  
E                train  
F                  foo  

 

  对数据的 index 进行排序并输出

 

1 print(df2.sort_index(axis=0, ascending=True))      #axis=0为选择列索引,axis=1为选择行索引
2 print(df2.sort_index(axis=1, ascending=False))   #ascending=True为正序,False为倒序
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
     F      E  D    C          B    A
0  foo   test  3  1.0 2013-01-02  1.0
1  foo  train  3  1.0 2013-01-02  1.0
2  foo   test  3  1.0 2013-01-02  1.0
3  foo  train  3  1.0 2013-01-02  1.0

 

  

  对数据 值 某一列 排序输出:

1 print(df2.sort_values(by='E'))
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
2  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
3  1.0 2013-01-02  1.0  3  train  foo

 

  • 3.2  选择数据

  

1 dates = pd.date_range('20130101', periods=6)
2 df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])
3 df
"""
             A   B   C   D
2013-01-01   0   1   2   3
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23
"""

 

  选择某一列

1 print(df['A'])
2 或者
3 print(df.A)
"""
2013-01-01     0
2013-01-02     4
2013-01-03     8
2013-01-04    12
2013-01-05    16
2013-01-06    20
"""

 

  选择跨越多行或多列

1 print(df[0:3])
"""
            A  B   C   D
2013-01-01  0  1   2   3
2013-01-02  4  5   6   7
2013-01-03  8  9  10  11
"""
1 print(df[0:3]["A"])
"""
2013-01-01    0
2013-01-02    4
2013-01-03    8
"""
1 print(df['20130102':'20130104'])
"""
A   B   C   D
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
"""

 

  loc 使用标签来选择数据 

1 print(df.loc['20130102'])
"""
A    4
B    5
C    6
D    7
Name: 2013-01-02 00:00:00, dtype: int64
"""
1 print(df.loc[:,['A','B']]) 
1 """
2              A   B
3 2013-01-01   0   1
4 2013-01-02   4   5
5 2013-01-03   8   9
6 2013-01-04  12  13
7 2013-01-05  16  17
8 2013-01-06  20  21
9 """
1 print(df.loc['20130102',['A','B']])
"""
A    4
B    5
Name: 2013-01-02 00:00:00, dtype: int64
"""

 

   iloc 根据序列来选择数据 

1 print(df)
2 print(df.iloc[3,1])    #第4行第2列
'''
             A   B   C   D
2013-01-01   0   1   2   3
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23


13

'''
1 print(df.iloc[3:5,1:3])    # 第三行到第五行,第一列到第三列
"""
             B   C
2013-01-04  13  14
2013-01-05  17  18
"""
1 print(df.iloc[[1,3,5],1:3])
"""
             B   C
2013-01-02   5   6
2013-01-04  13  14
2013-01-06  21  22

"""

 

  通过判断的筛选

1 print(df[df.A>8])
"""
             A   B   C   D
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23
"""

 

  当有条件筛选时,如下图筛选出所有C列PM2.5在I列1006A处的值:

 

 

1 path = 'D:\python\站点_20190101-20191231\china_sites_20190101.csv'
2 csv_data = pd.read_csv(path)
3 aa=csv_data[csv_data['type'] == 'PM2.5'][['type', '1006A']]
4 aa
######
    1006A
1    47.0
16    44.0
31    43.0
46    40.0
61    42.0
76    46.0
91    47.0
106    49.0
121    47.0
136    53.0
151    46.0
166    34.0
。。。。。。
#######

 

 

  3.2  编辑、写入值

1 # 创建数据
2 dates = pd.date_range('20130101', periods=6)
3 df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])
"""
             A   B   C   D
2013-01-01   0   1   2   3
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23
"""

 

  根据位置用loc 和 iloc设置

1 df.iloc[2,2] = 1111
2 df.loc['20130101','B'] = 2222
"""
             A     B     C   D
2013-01-01   0  2222     2   3
2013-01-02   4     5     6   7
2013-01-03   8     9  1111  11
2013-01-04  12    13    14  15
2013-01-05  16    17    18  19
2013-01-06  20    21    22  23
"""

 

  根据条件设置

1 df.B[df.A>4] = 0   #A列>4时,B列数据等于0
"""
             A     B     C   D
2013-01-01   0  2222     2   3
2013-01-02   4     5     6   7
2013-01-03   8     0  1111  11
2013-01-04  12     0    14  15
2013-01-05  16     0    18  19
2013-01-06  20     0    22  23 
"""

 

  按行或列设置

1 df['F'] = np.nan
"""
             A     B     C   D   F
2013-01-01   0  2222     2   3 NaN
2013-01-02   4     5     6   7 NaN
2013-01-03   8     0  1111  11 NaN
2013-01-04  12     0    14  15 NaN
2013-01-05  16     0    18  19 NaN
2013-01-06  20     0    22  23 NaN
"""

 

  

  生成日期

1 date_l=[datetime.strftime(x,'%Y%m%d') for x in list(pd.date_range(start="20190101", end="20190131"))] 
#'%Y%m%d'可以改变生成日期格式,如%Y-%m-%d,
['20190101',
 '20190102',
 '20190103',
 '20190104',
 '20190105',
 '20190106',
 '20190107',
 '20190108',
 '20190109',
 '20190110',
 '20190111',
 '20190112',
 '20190113',
 '20190114',
 '20190115',
 '20190116',
 '20190117',
 '20190118',
 '20190119',
 '20190120',
 '20190121',
 '20190122',
 '20190123',
 '20190124',
 '20190125',
 '20190126',
 '20190127',
 '20190128',
 '20190129',
 '20190130',
 '20190131']

 

  处理nan

  1.直接删除,将含有NaN的列(columns)去掉:

1 import pandas as pd
2 
3 df = pd.DataFrame({'a':[None,1,2,3],'b':[4,None,None,6],'c':[1,2,1,2],'d':[7,7,9,2]})
4 print (df)
5 
6 print (df.isnull().sum())
7 #查找有多少nan
     a    b  c  d
0  NaN  4.0  1  7
1  1.0  NaN  2  7
2  2.0  NaN  1  9
3  3.0  6.0  2  2
a    1
b    2
c    0
d    0
dtype: int64
1 data_without_NaN =df.dropna(axis=1)
2 print (data_without_NaN)
   c  d
0  1  7
1  2  7
2  1  9
3  2  2

  2.遗失值插补法

  以均值填补

1 from sklearn.preprocessing import Imputer
2 # 或者from sklearn.impute import SimpleImputer
3 my_imputer = Imputer()
4 data_imputed = my_imputer.fit_transform(df)
5 print (type(data_imputed))
6 # array转换成df
7 df_data_imputed = pd.DataFrame(data_imputed,columns=df.columns)
8 print (df_data_imputed)
     a    b    c    d
0  2.0  4.0  1.0  7.0
1  1.0  5.0  2.0  7.0
2  2.0  5.0  1.0  9.0
3  3.0  6.0  2.0  2.0

 

posted @ 2020-11-22 22:26  laiber  阅读(204)  评论(0)    收藏  举报