Python时间序列分析

时间序列与时间序列分析

  在生产和科学研究中,对某一个或者一组变量  进行观察测量,将在一系列时刻所得到的离散数字组成的序列集合,称之为时间序列。 
  时间序列分析是根据系统观察得到的时间序列数据,通过曲线拟合和参数估计来建立数学模型的理论和方法。时间序列分析常用于国民宏观经济控制、市场潜力预测、气象预测、农作物害虫灾害预报等各个方面。

Pandas生成时间序列:

import pandas as pd
import numpy as np  

时间序列

  • 时间戳(timestamp)
  • 固定周期(period)
  • 时间间隔(interval)

date_range

  • 可以指定开始时间与周期
  • H:小时
  • D:天
  • M:月
# TIMES的几种书写方式 #2016 Jul 1; 7/1/2016; 1/7/2016 ;2016-07-01; 2016/07/01
rng = pd.date_range('2016-07-01', periods = 10, freq = '3D')#不传freq则默认是D
rng

  结果:

DatetimeIndex(['2016-07-01', '2016-07-04', '2016-07-07', '2016-07-10',
               '2016-07-13', '2016-07-16', '2016-07-19', '2016-07-22',
               '2016-07-25', '2016-07-28'],
              dtype='datetime64[ns]', freq='3D')
View Code
time=pd.Series(np.random.randn(20),
           index=pd.date_range(dt.datetime(2016,1,1),periods=20))
print(time)
#结果:
2016-01-01   -0.129379
2016-01-02    0.164480
2016-01-03   -0.639117
2016-01-04   -0.427224
2016-01-05    2.055133
2016-01-06    1.116075
2016-01-07    0.357426
2016-01-08    0.274249
2016-01-09    0.834405
2016-01-10   -0.005444
2016-01-11   -0.134409
2016-01-12    0.249318
2016-01-13   -0.297842
2016-01-14   -0.128514
2016-01-15    0.063690
2016-01-16   -2.246031
2016-01-17    0.359552
2016-01-18    0.383030
2016-01-19    0.402717
2016-01-20   -0.694068
Freq: D, dtype: float64

truncate过滤

time.truncate(before='2016-1-10')#1月10之前的都被过滤掉了

  结果:

2016-01-10   -0.005444
2016-01-11   -0.134409
2016-01-12    0.249318
2016-01-13   -0.297842
2016-01-14   -0.128514
2016-01-15    0.063690
2016-01-16   -2.246031
2016-01-17    0.359552
2016-01-18    0.383030
2016-01-19    0.402717
2016-01-20   -0.694068
Freq: D, dtype: float64
View Code
time.truncate(after='2016-1-10')#1月10之后的都被过滤掉了
#结果:
2016-01-01   -0.129379
2016-01-02    0.164480
2016-01-03   -0.639117
2016-01-04   -0.427224
2016-01-05    2.055133
2016-01-06    1.116075
2016-01-07    0.357426
2016-01-08    0.274249
2016-01-09    0.834405
2016-01-10   -0.005444
Freq: D, dtype: float64

  

print(time['2016-01-15'])#0.063690487247
print(time['2016-01-15':'2016-01-20'])
结果:
2016-01-15    0.063690
2016-01-16   -2.246031
2016-01-17    0.359552
2016-01-18    0.383030
2016-01-19    0.402717
2016-01-20   -0.694068
Freq: D, dtype: float64


data=pd.date_range('2010-01-01','2011-01-01',freq='M')
print(data)
#结果:
DatetimeIndex(['2010-01-31', '2010-02-28', '2010-03-31', '2010-04-30',
               '2010-05-31', '2010-06-30', '2010-07-31', '2010-08-31',
               '2010-09-30', '2010-10-31', '2010-11-30', '2010-12-31'],
              dtype='datetime64[ns]', freq='M')

  

#时间戳
pd.Timestamp('2016-07-10')#Timestamp('2016-07-10 00:00:00')
# 可以指定更多细节
pd.Timestamp('2016-07-10 10')#Timestamp('2016-07-10 10:00:00')
pd.Timestamp('2016-07-10 10:15')#Timestamp('2016-07-10 10:15:00')

# How much detail can you add?
t = pd.Timestamp('2016-07-10 10:15')

# 时间区间
pd.Period('2016-01')#Period('2016-01', 'M')
pd.Period('2016-01-01')#Period('2016-01-01', 'D')

# TIME OFFSETS
pd.Timedelta('1 day')#Timedelta('1 days 00:00:00')
pd.Period('2016-01-01 10:10') + pd.Timedelta('1 day')#Period('2016-01-02 10:10', 'T')
pd.Timestamp('2016-01-01 10:10') + pd.Timedelta('1 day')#Timestamp('2016-01-02 10:10:00')
pd.Timestamp('2016-01-01 10:10') + pd.Timedelta('15 ns')#Timestamp('2016-01-01 10:10:00.000000015')

p1 = pd.period_range('2016-01-01 10:10', freq = '25H', periods = 10)
p2 = pd.period_range('2016-01-01 10:10', freq = '1D1H', periods = 10)
p1
p2
结果:
PeriodIndex(['2016-01-01 10:00', '2016-01-02 11:00', '2016-01-03 12:00',
             '2016-01-04 13:00', '2016-01-05 14:00', '2016-01-06 15:00',
             '2016-01-07 16:00', '2016-01-08 17:00', '2016-01-09 18:00',
             '2016-01-10 19:00'],
            dtype='period[25H]', freq='25H')
PeriodIndex(['2016-01-01 10:00', '2016-01-02 11:00', '2016-01-03 12:00',
             '2016-01-04 13:00', '2016-01-05 14:00', '2016-01-06 15:00',
             '2016-01-07 16:00', '2016-01-08 17:00', '2016-01-09 18:00',
             '2016-01-10 19:00'],
            dtype='period[25H]', freq='25H')

# 指定索引
rng = pd.date_range('2016 Jul 1', periods = 10, freq = 'D')
rng
pd.Series(range(len(rng)), index = rng)
结果:
2016-07-01    0
2016-07-02    1
2016-07-03    2
2016-07-04    3
2016-07-05    4
2016-07-06    5
2016-07-07    6
2016-07-08    7
2016-07-09    8
2016-07-10    9
Freq: D, dtype: int32

periods = [pd.Period('2016-01'), pd.Period('2016-02'), pd.Period('2016-03')]
ts = pd.Series(np.random.randn(len(periods)), index = periods)
ts
结果:
2016-01   -0.015837
2016-02   -0.923463
2016-03   -0.485212
Freq: M, dtype: float64

type(ts.index)#pandas.core.indexes.period.PeriodIndex

# 时间戳和时间周期可以转换
ts = pd.Series(range(10), pd.date_range('07-10-16 8:00', periods = 10, freq = 'H'))
ts
结果:
2016-07-10 08:00:00    0
2016-07-10 09:00:00    1
2016-07-10 10:00:00    2
2016-07-10 11:00:00    3
2016-07-10 12:00:00    4
2016-07-10 13:00:00    5
2016-07-10 14:00:00    6
2016-07-10 15:00:00    7
2016-07-10 16:00:00    8
2016-07-10 17:00:00    9
Freq: H, dtype: int32

ts_period = ts.to_period()
ts_period
结果:
2016-07-10 08:00    0
2016-07-10 09:00    1
2016-07-10 10:00    2
2016-07-10 11:00    3
2016-07-10 12:00    4
2016-07-10 13:00    5
2016-07-10 14:00    6
2016-07-10 15:00    7
2016-07-10 16:00    8
2016-07-10 17:00    9
Freq: H, dtype: int32

时间周期与时间戳的区别

ts_period['2016-07-10 08:30':'2016-07-10 11:45'] #时间周期包含08:00
结果:
2016-07-10 08:00    0
2016-07-10 09:00    1
2016-07-10 10:00    2
2016-07-10 11:00    3
Freq: H, dtype: int32

ts['2016-07-10 08:30':'2016-07-10 11:45'] #时间戳不包含08:30
#结果:
2016-07-10 09:00:00    1
2016-07-10 10:00:00    2
2016-07-10 11:00:00    3
Freq: H, dtype: int32

数据重采样:

  • 时间数据由一个频率转换到另一个频率
  • 降采样
  • 升采样
import pandas as pd
import numpy as np
rng = pd.date_range('1/1/2011', periods=90, freq='D')#数据按天
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts.head()
结果:
2011-01-01   -1.025562
2011-01-02    0.410895
2011-01-03    0.660311
2011-01-04    0.710293
2011-01-05    0.444985
Freq: D, dtype: float64

ts.resample('M').sum()#数据降采样,降为月,指标是求和,也可以平均,自己指定
结果:
2011-01-31    2.510102
2011-02-28    0.583209
2011-03-31    2.749411
Freq: M, dtype: float64

ts.resample('3D').sum()#数据降采样,降为3天
结果:
2011-01-01    0.045643
2011-01-04   -2.255206
2011-01-07    0.571142
2011-01-10    0.835032
2011-01-13   -0.396766
2011-01-16   -1.156253
2011-01-19   -1.286884
2011-01-22    2.883952
2011-01-25    1.566908
2011-01-28    1.435563
2011-01-31    0.311565
2011-02-03   -2.541235
2011-02-06    0.317075
2011-02-09    1.598877
2011-02-12   -1.950509
2011-02-15    2.928312
2011-02-18   -0.733715
2011-02-21    1.674817
2011-02-24   -2.078872
2011-02-27    2.172320
2011-03-02   -2.022104
2011-03-05   -0.070356
2011-03-08    1.276671
2011-03-11   -2.835132
2011-03-14   -1.384113
2011-03-17    1.517565
2011-03-20   -0.550406
2011-03-23    0.773430
2011-03-26    2.244319
2011-03-29    2.951082
Freq: 3D, dtype: float64

day3Ts = ts.resample('3D').mean()
day3Ts
结果:
2011-01-01    0.015214
2011-01-04   -0.751735
2011-01-07    0.190381
2011-01-10    0.278344
2011-01-13   -0.132255
2011-01-16   -0.385418
2011-01-19   -0.428961
2011-01-22    0.961317
2011-01-25    0.522303
2011-01-28    0.478521
2011-01-31    0.103855
2011-02-03   -0.847078
2011-02-06    0.105692
2011-02-09    0.532959
2011-02-12   -0.650170
2011-02-15    0.976104
2011-02-18   -0.244572
2011-02-21    0.558272
2011-02-24   -0.692957
2011-02-27    0.724107
2011-03-02   -0.674035
2011-03-05   -0.023452
2011-03-08    0.425557
2011-03-11   -0.945044
2011-03-14   -0.461371
2011-03-17    0.505855
2011-03-20   -0.183469
2011-03-23    0.257810
2011-03-26    0.748106
2011-03-29    0.983694
Freq: 3D, dtype: float64

print(day3Ts.resample('D').asfreq())#升采样,要进行插值
结果:
2011-01-01    0.015214
2011-01-02         NaN
2011-01-03         NaN
2011-01-04   -0.751735
2011-01-05         NaN
2011-01-06         NaN
2011-01-07    0.190381
2011-01-08         NaN
2011-01-09         NaN
2011-01-10    0.278344
2011-01-11         NaN
2011-01-12         NaN
2011-01-13   -0.132255
2011-01-14         NaN
2011-01-15         NaN
2011-01-16   -0.385418
2011-01-17         NaN
2011-01-18         NaN
2011-01-19   -0.428961
2011-01-20         NaN
2011-01-21         NaN
2011-01-22    0.961317
2011-01-23         NaN
2011-01-24         NaN
2011-01-25    0.522303
2011-01-26         NaN
2011-01-27         NaN
2011-01-28    0.478521
2011-01-29         NaN
2011-01-30         NaN
                ...   
2011-02-28         NaN
2011-03-01         NaN
2011-03-02   -0.674035
2011-03-03         NaN
2011-03-04         NaN
2011-03-05   -0.023452
2011-03-06         NaN
2011-03-07         NaN
2011-03-08    0.425557
2011-03-09         NaN
2011-03-10         NaN
2011-03-11   -0.945044
2011-03-12         NaN
2011-03-13         NaN
2011-03-14   -0.461371
2011-03-15         NaN
2011-03-16         NaN
2011-03-17    0.505855
2011-03-18         NaN
2011-03-19         NaN
2011-03-20   -0.183469
2011-03-21         NaN
2011-03-22         NaN
2011-03-23    0.257810
2011-03-24         NaN
2011-03-25         NaN
2011-03-26    0.748106
2011-03-27         NaN
2011-03-28         NaN
2011-03-29    0.983694
Freq: D, Length: 88, dtype: float64

插值方法:

  • ffill 空值取前面的值
  • bfill 空值取后面的值
  • interpolate 线性取值
day3Ts.resample('D').ffill(1)
结果:
2011-01-01    0.015214
2011-01-02    0.015214
2011-01-03         NaN
2011-01-04   -0.751735
2011-01-05   -0.751735
2011-01-06         NaN
2011-01-07    0.190381
2011-01-08    0.190381
2011-01-09         NaN
2011-01-10    0.278344
2011-01-11    0.278344
2011-01-12         NaN
2011-01-13   -0.132255
2011-01-14   -0.132255
2011-01-15         NaN
2011-01-16   -0.385418
2011-01-17   -0.385418
2011-01-18         NaN
2011-01-19   -0.428961
2011-01-20   -0.428961
2011-01-21         NaN
2011-01-22    0.961317
2011-01-23    0.961317
2011-01-24         NaN
2011-01-25    0.522303
2011-01-26    0.522303
2011-01-27         NaN
2011-01-28    0.478521
2011-01-29    0.478521
2011-01-30         NaN
                ...   
2011-02-28    0.724107
2011-03-01         NaN
2011-03-02   -0.674035
2011-03-03   -0.674035
2011-03-04         NaN
2011-03-05   -0.023452
2011-03-06   -0.023452
2011-03-07         NaN
2011-03-08    0.425557
2011-03-09    0.425557
2011-03-10         NaN
2011-03-11   -0.945044
2011-03-12   -0.945044
2011-03-13         NaN
2011-03-14   -0.461371
2011-03-15   -0.461371
2011-03-16         NaN
2011-03-17    0.505855
2011-03-18    0.505855
2011-03-19         NaN
2011-03-20   -0.183469
2011-03-21   -0.183469
2011-03-22         NaN
2011-03-23    0.257810
2011-03-24    0.257810
2011-03-25         NaN
2011-03-26    0.748106
2011-03-27    0.748106
2011-03-28         NaN
2011-03-29    0.983694
Freq: D, Length: 88, dtype: float64

day3Ts.resample('D').bfill(1)
结果:
2011-01-01    0.015214
2011-01-02         NaN
2011-01-03   -0.751735
2011-01-04   -0.751735
2011-01-05         NaN
2011-01-06    0.190381
2011-01-07    0.190381
2011-01-08         NaN
2011-01-09    0.278344
2011-01-10    0.278344
2011-01-11         NaN
2011-01-12   -0.132255
2011-01-13   -0.132255
2011-01-14         NaN
2011-01-15   -0.385418
2011-01-16   -0.385418
2011-01-17         NaN
2011-01-18   -0.428961
2011-01-19   -0.428961
2011-01-20         NaN
2011-01-21    0.961317
2011-01-22    0.961317
2011-01-23         NaN
2011-01-24    0.522303
2011-01-25    0.522303
2011-01-26         NaN
2011-01-27    0.478521
2011-01-28    0.478521
2011-01-29         NaN
2011-01-30    0.103855
                ...   
2011-02-28         NaN
2011-03-01   -0.674035
2011-03-02   -0.674035
2011-03-03         NaN
2011-03-04   -0.023452
2011-03-05   -0.023452
2011-03-06         NaN
2011-03-07    0.425557
2011-03-08    0.425557
2011-03-09         NaN
2011-03-10   -0.945044
2011-03-11   -0.945044
2011-03-12         NaN
2011-03-13   -0.461371
2011-03-14   -0.461371
2011-03-15         NaN
2011-03-16    0.505855
2011-03-17    0.505855
2011-03-18         NaN
2011-03-19   -0.183469
2011-03-20   -0.183469
2011-03-21         NaN
2011-03-22    0.257810
2011-03-23    0.257810
2011-03-24         NaN
2011-03-25    0.748106
2011-03-26    0.748106
2011-03-27         NaN
2011-03-28    0.983694
2011-03-29    0.983694
Freq: D, Length: 88, dtype: float64

day3Ts.resample('D').interpolate('linear')#线性拟合填充
结果:
2011-01-01    0.015214
2011-01-02   -0.240435
2011-01-03   -0.496085
2011-01-04   -0.751735
2011-01-05   -0.437697
2011-01-06   -0.123658
2011-01-07    0.190381
2011-01-08    0.219702
2011-01-09    0.249023
2011-01-10    0.278344
2011-01-11    0.141478
2011-01-12    0.004611
2011-01-13   -0.132255
2011-01-14   -0.216643
2011-01-15   -0.301030
2011-01-16   -0.385418
2011-01-17   -0.399932
2011-01-18   -0.414447
2011-01-19   -0.428961
2011-01-20    0.034465
2011-01-21    0.497891
2011-01-22    0.961317
2011-01-23    0.814979
2011-01-24    0.668641
2011-01-25    0.522303
2011-01-26    0.507709
2011-01-27    0.493115
2011-01-28    0.478521
2011-01-29    0.353632
2011-01-30    0.228744
                ...   
2011-02-28    0.258060
2011-03-01   -0.207988
2011-03-02   -0.674035
2011-03-03   -0.457174
2011-03-04   -0.240313
2011-03-05   -0.023452
2011-03-06    0.126218
2011-03-07    0.275887
2011-03-08    0.425557
2011-03-09   -0.031310
2011-03-10   -0.488177
2011-03-11   -0.945044
2011-03-12   -0.783820
2011-03-13   -0.622595
2011-03-14   -0.461371
2011-03-15   -0.138962
2011-03-16    0.183446
2011-03-17    0.505855
2011-03-18    0.276080
2011-03-19    0.046306
2011-03-20   -0.183469
2011-03-21   -0.036376
2011-03-22    0.110717
2011-03-23    0.257810
2011-03-24    0.421242
2011-03-25    0.584674
2011-03-26    0.748106
2011-03-27    0.826636
2011-03-28    0.905165
2011-03-29    0.983694
Freq: D, Length: 88, dtype: float64

Pandas滑动窗口:

  滑动窗口就是能够根据指定的单位长度来框住时间序列,从而计算框内的统计指标。相当于一个长度指定的滑块在刻度尺上面滑动,每滑动一个单位即可反馈滑块内的数据。

  滑动窗口可以使数据更加平稳,浮动范围会比较小,具有代表性,单独拿出一个数据可能或多或少会离群,有差异或者错误,使用滑动窗口会更规范一些。

%matplotlib inline 
import matplotlib.pylab
import numpy as np
import pandas as pd
df = pd.Series(np.random.randn(600), index = pd.date_range('7/1/2016', freq = 'D', periods = 600))
df.head()
结果:
2016-07-01   -0.192140
2016-07-02    0.357953
2016-07-03   -0.201847
2016-07-04   -0.372230
2016-07-05    1.414753
Freq: D, dtype: float64

r = df.rolling(window = 10)
r#Rolling [window=10,center=False,axis=0]

#r.max, r.median, r.std, r.skew倾斜度, r.sum, r.var
print(r.mean())
结果:
2016-07-01         NaN
2016-07-02         NaN
2016-07-03         NaN
2016-07-04         NaN
2016-07-05         NaN
2016-07-06         NaN
2016-07-07         NaN
2016-07-08         NaN
2016-07-09         NaN
2016-07-10    0.300133
2016-07-11    0.284780
2016-07-12    0.252831
2016-07-13    0.220699
2016-07-14    0.167137
2016-07-15    0.018593
2016-07-16   -0.061414
2016-07-17   -0.134593
2016-07-18   -0.153333
2016-07-19   -0.218928
2016-07-20   -0.169426
2016-07-21   -0.219747
2016-07-22   -0.181266
2016-07-23   -0.173674
2016-07-24   -0.130629
2016-07-25   -0.166730
2016-07-26   -0.233044
2016-07-27   -0.256642
2016-07-28   -0.280738
2016-07-29   -0.289893
2016-07-30   -0.379625
                ...   
2018-01-22   -0.211467
2018-01-23    0.034996
2018-01-24   -0.105910
2018-01-25   -0.145774
2018-01-26   -0.089320
2018-01-27   -0.164370
2018-01-28   -0.110892
2018-01-29   -0.205786
2018-01-30   -0.101162
2018-01-31   -0.034760
2018-02-01    0.229333
2018-02-02    0.043741
2018-02-03    0.052837
2018-02-04    0.057746
2018-02-05   -0.071401
2018-02-06   -0.011153
2018-02-07   -0.045737
2018-02-08   -0.021983
2018-02-09   -0.196715
2018-02-10   -0.063721
2018-02-11   -0.289452
2018-02-12   -0.050946
2018-02-13   -0.047014
2018-02-14    0.048754
2018-02-15    0.143949
2018-02-16    0.424823
2018-02-17    0.361878
2018-02-18    0.363235
2018-02-19    0.517436
2018-02-20    0.368020
Freq: D, Length: 600, dtype: float64

import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(15, 5))

df.plot(style='r--')
df.rolling(window=10).mean().plot(style='b')#<matplotlib.axes._subplots.AxesSubplot at 0x249627fb6d8>

  结果:

数据平稳性与差分法:

  基本模型:自回归移动平均模型(ARMA(p,q))是时间序列中最为重要的模型之一。它主要由两部分组成: AR代表p阶自回归过程,MA代表q阶移动平均过程。

平稳性检验

  我们知道序列平稳性是进行时间序列分析的前提条件,很多人都会有疑问,为什么要满足平稳性的要求呢?在大数定理和中心定理中要求样本同分布(这里同分布等价于时间序列中的平稳性),而我们的建模过程中有很多都是建立在大数定理和中心极限定理的前提条件下的,如果它不满足,得到的许多结论都是不可靠的。以虚假回归为例,当响应变量和输入变量都平稳时,我们用t统计量检验标准化系数的显著性。而当响应变量和输入变量不平稳时,其标准化系数不在满足t分布,这时再用t检验来进行显著性分析,导致拒绝原假设的概率增加,即容易犯第一类错误,从而得出错误的结论。

  平稳时间序列有两种定义:严平稳和宽平稳

  严平稳顾名思义,是一种条件非常苛刻的平稳性,它要求序列随着时间的推移,其统计性质保持不变。对于任意的τ,其联合概率密度函数满足:

  严平稳的条件只是理论上的存在,现实中用得比较多的是宽平稳的条件。

宽平稳也叫弱平稳或者二阶平稳(均值和方差平稳),它应满足:

  • 常数均值
  • 常数方差
  • 常数自协方差

 

  ARIMA 模型对时间序列的要求是平稳型。因此,当你得到一个非平稳的时间序列时,首先要做的即是做时间序列的差分,直到得到一个平稳时间序列。如果你对时间序列做d次差分才能得到一个平稳序列,那么可以使用ARIMA(p,d,q)模型,其中d是差分次数。

二阶差分是指在一阶差分基础上再做一阶差分。

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format='retina'

from __future__ import absolute_import, division, print_function
# http://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost各种python库文件的下载,基本可以找到所有的
import sys
import os

import pandas as pd
import numpy as np

# # Remote Data Access
# import pandas_datareader.data as web
# import datetime
# # reference: https://pandas-datareader.readthedocs.io/en/latest/remote_data.html

# TSA from Statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt

# Display and Plotting
import matplotlib.pylab as plt
import seaborn as sns

pd.set_option('display.float_format', lambda x: '%.5f' % x) # pandas
np.set_printoptions(precision=5, suppress=True) # numpy

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

# seaborn plotting style
sns.set(style='ticks', context='poster')
结果:
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

  

#Read the data
#美国消费者信心指数
Sentiment = 'data/sentiment.csv'
Sentiment = pd.read_csv(Sentiment, index_col=0, parse_dates=[0])

  

Sentiment.head()

  结果:

 UMCSENT
DATE 
2000-01-01 112.00000
2000-02-01 111.30000
2000-03-01 107.10000
2000-04-01 109.20000
2000-05-01 110.70000
# Select the series from 2005 - 2016
sentiment_short = Sentiment.loc['2005':'2016']

  

sentiment_short.plot(figsize=(12,8))
plt.legend(bbox_to_anchor=(1.25, 0.5))
plt.title("Consumer Sentiment")
sns.despine()

  结果:

sentiment_short['diff_1'] = sentiment_short['UMCSENT'].diff(1)#求差分值,一阶差分。     1指的是1个时间间隔,可更改。

sentiment_short['diff_2'] = sentiment_short['diff_1'].diff(1)#再求差分,二阶差分。

sentiment_short.plot(subplots=True, figsize=(18, 12))

  结果:

array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001D9383BACF8>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001D939FAB6A0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x000001D93A139B70>], dtype=object)
View Code

ARIMA模型:

相关函数评估方法:

通过ACF和PACF的图选择出p值和q值。

 建立ARIMA模型:

del sentiment_short['diff_2']
del sentiment_short['diff_1']
sentiment_short.head()
print (type(sentiment_short))#<class 'pandas.core.frame.DataFrame'>

  

fig = plt.figure(figsize=(12,8))
#acf
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(sentiment_short, lags=20,ax=ax1)
ax1.xaxis.set_ticks_position('bottom')
fig.tight_layout();
#pacf
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(sentiment_short, lags=20, ax=ax2)
ax2.xaxis.set_ticks_position('bottom')
fig.tight_layout();
#下图中的阴影表示置信区间,可以看出不同阶数自相关性的变化情况,从而选出p值和q值

  结果:

 

# 散点图也可以表示

lags=9

ncols=3
nrows=int(np.ceil(lags/ncols))

fig, axes = plt.subplots(ncols=ncols, nrows=nrows, figsize=(4*ncols, 4*nrows))

for ax, lag in zip(axes.flat, np.arange(1,lags+1, 1)):
    lag_str = 't-{}'.format(lag)
    X = (pd.concat([sentiment_short, sentiment_short.shift(-lag)], axis=1,
                   keys=['y'] + [lag_str]).dropna())

    X.plot(ax=ax, kind='scatter', y='y', x=lag_str);
    corr = X.corr().as_matrix()[0][1]
    ax.set_ylabel('Original')
    ax.set_title('Lag: {} (corr={:.2f})'.format(lag_str, corr));
    ax.set_aspect('equal');
    sns.despine();

fig.tight_layout();

  结果:

# 更直观一些
#模板,使用时直接改自己的数据就行,用以下四个图进行评估和分析就可以
def tsplot(y, lags=None, title='', figsize=(14, 8)):
   
    fig = plt.figure(figsize=figsize)
    layout = (2, 2)
    ts_ax   = plt.subplot2grid(layout, (0, 0))
    hist_ax = plt.subplot2grid(layout, (0, 1))
    acf_ax  = plt.subplot2grid(layout, (1, 0))
    pacf_ax = plt.subplot2grid(layout, (1, 1))
    
    y.plot(ax=ts_ax)
    ts_ax.set_title(title)
    y.plot(ax=hist_ax, kind='hist', bins=25)
    hist_ax.set_title('Histogram')
    smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
    smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
    [ax.set_xlim(0) for ax in [acf_ax, pacf_ax]]
    sns.despine()
    plt.tight_layout()
    return ts_ax, acf_ax, pacf_ax

tsplot(sentiment_short, title='Consumer Sentiment', lags=36);

  结果:

参数选择:

BIC的结果受样本的影响,使用同一样本时,可以选择BIC。

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format='retina'

from __future__ import absolute_import, division, print_function

import sys
import os

import pandas as pd
import numpy as np

# TSA from Statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt

# Display and Plotting
import matplotlib.pylab as plt
import seaborn as sns

pd.set_option('display.float_format', lambda x: '%.5f' % x) # pandas
np.set_printoptions(precision=5, suppress=True) # numpy

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

# seaborn plotting style
sns.set(style='ticks', context='poster')

  

filename_ts = 'data/series1.csv'
ts_df = pd.read_csv(filename_ts, index_col=0, parse_dates=[0])

n_sample = ts_df.shape[0]

  

print(ts_df.shape)
print(ts_df.head())
结果:
(120, 1)
              value
2006-06-01  0.21507
2006-07-01  1.14225
2006-08-01  0.08077
2006-09-01 -0.73952
2006-10-01  0.53552

  

# Create a training sample and testing sample before analyzing the series

n_train=int(0.95*n_sample)+1
n_forecast=n_sample-n_train
#ts_df
ts_train = ts_df.iloc[:n_train]['value']
ts_test = ts_df.iloc[n_train:]['value']
print(ts_train.shape)
print(ts_test.shape)
print("Training Series:", "\n", ts_train.tail(), "\n")
print("Testing Series:", "\n", ts_test.head())

  结果:

(115,)
(5,)
Training Series: 
 2015-08-01    0.60371
2015-09-01   -1.27372
2015-10-01   -0.93284
2015-11-01    0.08552
2015-12-01    1.20534
Name: value, dtype: float64 

Testing Series: 
 2016-01-01    2.16411
2016-02-01    0.95226
2016-03-01    0.36485
2016-04-01   -2.26487
2016-05-01   -2.38168
Name: value, dtype: float64
View Code
def tsplot(y, lags=None, title='', figsize=(14, 8)):
    
    fig = plt.figure(figsize=figsize)
    layout = (2, 2)
    ts_ax   = plt.subplot2grid(layout, (0, 0))
    hist_ax = plt.subplot2grid(layout, (0, 1))
    acf_ax  = plt.subplot2grid(layout, (1, 0))
    pacf_ax = plt.subplot2grid(layout, (1, 1))
    
    y.plot(ax=ts_ax)
    ts_ax.set_title(title)
    y.plot(ax=hist_ax, kind='hist', bins=25)
    hist_ax.set_title('Histogram')
    smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
    smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
    [ax.set_xlim(0) for ax in [acf_ax, pacf_ax]]
    sns.despine()
    fig.tight_layout()
    return ts_ax, acf_ax, pacf_ax

  

tsplot(ts_train, title='A Given Training Series', lags=20);

  结果:

#Model Estimation

# Fit the model
arima200 = sm.tsa.SARIMAX(ts_train, order=(2,0,0))#order里边的三个参数p,d,q
model_results = arima200.fit()#fit模型

  

import itertools
#当多组值都不符合时,遍历多组值,得出最好的值
p_min = 0
d_min = 0
q_min = 0
p_max = 4
d_max = 0
q_max = 4

# Initialize a DataFrame to store the results
results_bic = pd.DataFrame(index=['AR{}'.format(i) for i in range(p_min,p_max+1)],
                           columns=['MA{}'.format(i) for i in range(q_min,q_max+1)])

for p,d,q in itertools.product(range(p_min,p_max+1),
                               range(d_min,d_max+1),
                               range(q_min,q_max+1)):
    if p==0 and d==0 and q==0:
        results_bic.loc['AR{}'.format(p), 'MA{}'.format(q)] = np.nan
        continue
    
    try:
        model = sm.tsa.SARIMAX(ts_train, order=(p, d, q),
                               #enforce_stationarity=False,
                               #enforce_invertibility=False,
                              )
        results = model.fit()
        results_bic.loc['AR{}'.format(p), 'MA{}'.format(q)] = results.bic
    except:
        continue
results_bic = results_bic[results_bic.columns].astype(float)

  

fig, ax = plt.subplots(figsize=(10, 8))
ax = sns.heatmap(results_bic,
                 mask=results_bic.isnull(),
                 ax=ax,
                 annot=True,
                 fmt='.2f',
                 );
ax.set_title('BIC');

  结果:

# Alternative model selection method, limited to only searching AR and MA parameters

train_results = sm.tsa.arma_order_select_ic(ts_train, ic=['aic', 'bic'], trend='nc', max_ar=4, max_ma=4)

print('AIC', train_results.aic_min_order)
print('BIC', train_results.bic_min_order)
结果:得出两个不同的标准,比较尴尬,还需要进行筛选
AIC (4, 2)
BIC (1, 1)

#残差分析 正态分布 QQ图线性
model_results.plot_diagnostics(figsize=(16, 12));#statsmodels库

  结果:

Q-Q图:越像直线,则是正态分布;越不是直线,离正态分布越远。

时间序列建模基本步骤:

  1. 获取被观测系统时间序列数据;
  2. 对数据绘图,观测是否为平稳时间序列;对于非平稳时间序列要先进行d阶差分运算,化为平稳时间序列;
  3. 经过第二步处理,已经得到平稳时间序列。要对平稳时间序列分别求得其自相关系数ACF 和偏自相关系数PACF ,通过对自相关图和偏自相关图的分析,得到最佳的阶层 p 和阶数 q
  4. 由以上得到的 ,得到ARIMA模型。然后开始对得到的模型进行模型检验。

股票预测(属于回归):

%matplotlib inline
import pandas as pd
import pandas_datareader#用于从雅虎财经获取股票数据
import datetime
import matplotlib.pylab as plt
import seaborn as sns
from matplotlib.pylab import style
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

style.use('ggplot')    
plt.rcParams['font.sans-serif'] = ['SimHei'] 
plt.rcParams['axes.unicode_minus'] = False  

  

stockFile = 'data/T10yr.csv'
stock = pd.read_csv(stockFile, index_col=0, parse_dates=[0])#将索引index设置为时间,parse_dates对日期格式处理为标准格式。
stock.head(10)

  结果:

 OpenHighLowCloseVolumeAdj Close
Date      
2000-01-03 6.498 6.603 6.498 6.548 0 6.548
2000-01-04 6.530 6.548 6.485 6.485 0 6.485
2000-01-05 6.521 6.599 6.508 6.599 0 6.599
2000-01-06 6.558 6.585 6.540 6.549 0 6.549
2000-01-07 6.545 6.595 6.504 6.504 0 6.504
2000-01-10 6.540 6.567 6.536 6.558 0 6.558
2000-01-11 6.600 6.664 6.595 6.664 0 6.664
2000-01-12 6.659 6.696 6.645 6.696 0 6.696
2000-01-13 6.664 6.705 6.618 6.618 0 6.618
2000-01-14 6.623 6.688 6.563 6.674 0 6.674
stock_week = stock['Close'].resample('W-MON').mean()
stock_train = stock_week['2000':'2015']

  

stock_train.plot(figsize=(12,8))
plt.legend(bbox_to_anchor=(1.25, 0.5))
plt.title("Stock Close")
sns.despine()

  结果:

stock_diff = stock_train.diff()
stock_diff = stock_diff.dropna()

plt.figure()
plt.plot(stock_diff)
plt.title('一阶差分')
plt.show()

  结果:

acf = plot_acf(stock_diff, lags=20)
plt.title("ACF")
acf.show()

  结果:

pacf = plot_pacf(stock_diff, lags=20)
plt.title("PACF")
pacf.show()

  结果:

model = ARIMA(stock_train, order=(1, 1, 1),freq='W-MON')

  

result = model.fit()
#print(result.summary())#统计出ARIMA模型的指标

  

pred = result.predict('20140609', '20160701',dynamic=True, typ='levels')#预测,指定起始与终止时间。预测值起始时间必须在原始数据中,终止时间不需要
print (pred)
结果:
2014-06-09    2.463559
2014-06-16    2.455539
2014-06-23    2.449569
2014-06-30    2.444183
2014-07-07    2.438962
2014-07-14    2.433788
2014-07-21    2.428627
2014-07-28    2.423470
2014-08-04    2.418315
2014-08-11    2.413159
2014-08-18    2.408004
2014-08-25    2.402849
2014-09-01    2.397693
2014-09-08    2.392538
2014-09-15    2.387383
2014-09-22    2.382227
2014-09-29    2.377072
2014-10-06    2.371917
2014-10-13    2.366761
2014-10-20    2.361606
2014-10-27    2.356451
2014-11-03    2.351296
2014-11-10    2.346140
2014-11-17    2.340985
2014-11-24    2.335830
2014-12-01    2.330674
2014-12-08    2.325519
2014-12-15    2.320364
2014-12-22    2.315208
2014-12-29    2.310053
                ...   
2015-12-07    2.057443
2015-12-14    2.052288
2015-12-21    2.047132
2015-12-28    2.041977
2016-01-04    2.036822
2016-01-11    2.031666
2016-01-18    2.026511
2016-01-25    2.021356
2016-02-01    2.016200
2016-02-08    2.011045
2016-02-15    2.005890
2016-02-22    2.000735
2016-02-29    1.995579
2016-03-07    1.990424
2016-03-14    1.985269
2016-03-21    1.980113
2016-03-28    1.974958
2016-04-04    1.969803
2016-04-11    1.964647
2016-04-18    1.959492
2016-04-25    1.954337
2016-05-02    1.949181
2016-05-09    1.944026
2016-05-16    1.938871
2016-05-23    1.933716
2016-05-30    1.928560
2016-06-06    1.923405
2016-06-13    1.918250
2016-06-20    1.913094
2016-06-27    1.907939
Freq: W-MON, Length: 108, dtype: float64

  

plt.figure(figsize=(6, 6))
plt.xticks(rotation=45)
plt.plot(pred)
plt.plot(stock_train)#[<matplotlib.lines.Line2D at 0x28025665278>]

  结果:

使用tfresh库进行分类任务:

tsfresh是开源的提取时序数据特征的python包,能够提取出超过64种特征,堪称提取时序特征的瑞士军刀。用到时tfresh查官方文档

%matplotlib inline
import matplotlib.pylab as plt
import seaborn as sns
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures, load_robot_execution_failures
from tsfresh import extract_features, extract_relevant_features, select_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import ComprehensiveFCParameters
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report


#http://tsfresh.readthedocs.io/en/latest/text/quick_start.html#官方文档

  

download_robot_execution_failures()
df, y = load_robot_execution_failures()
df.head()

  结果:

id    time    a    b    c    d    e    f
0    1    0    -1    -1    63    -3    -1    0
1    1    1    0    0    62    -3    -1    0
2    1    2    -1    -1    61    -3    0    0
3    1    3    -1    -1    63    -2    -1    0
4    1    4    -1    -1    63    -3    -1    0
View Code
df[df.id == 3][['time', 'a', 'b', 'c', 'd', 'e', 'f']].plot(x='time', title='Success example (id 3)', figsize=(12, 6));
df[df.id == 20][['time', 'a', 'b', 'c', 'd', 'e', 'f']].plot(x='time', title='Failure example (id 20)', figsize=(12, 6));

  结果:

extraction_settings = ComprehensiveFCParameters()#提取特征

  

#column_id (str) – The name of the id column to group by
#column_sort (str) – The name of the sort column.
X = extract_features(df, 
                     column_id='id', column_sort='time',#以id为聚合,以time排序
                     default_fc_parameters=extraction_settings,
                     impute_function= impute)

  

X.head()#提取到的特征

  结果:

a__mean_abs_change_quantiles__qh_1.0__ql_0.8    a__percentage_of_reoccurring_values_to_all_values    a__mean_abs_change_quantiles__qh_1.0__ql_0.2    a__mean_abs_change_quantiles__qh_1.0__ql_0.0    a__large_standard_deviation__r_0.45    a__absolute_sum_of_changes    a__mean_abs_change_quantiles__qh_1.0__ql_0.4    a__mean_second_derivate_central    a__autocorrelation__lag_4    a__binned_entropy__max_bins_10    ...    f__fft_coefficient__coeff_0    f__fft_coefficient__coeff_1    f__fft_coefficient__coeff_2    f__fft_coefficient__coeff_3    f__fft_coefficient__coeff_4    f__fft_coefficient__coeff_5    f__fft_coefficient__coeff_6    f__fft_coefficient__coeff_7    f__fft_coefficient__coeff_8    f__fft_coefficient__coeff_9
id                                                                                    
1    0.142857    0.933333    0.142857    0.142857    0.0    2.0    0.142857    -0.038462    0.17553    0.244930    ...    0.0    0.000000    0.000000    0.000000    0.000000    0.0    0.000000    0.000000    0.0    0.0
2    0.000000    1.000000    0.400000    1.000000    0.0    14.0    0.400000    -0.038462    0.17553    0.990835    ...    -4.0    0.744415    1.273659    -0.809017    1.373619    0.5    0.309017    -1.391693    0.0    0.0
3    0.000000    0.933333    0.714286    0.714286    0.0    10.0    0.714286    -0.038462    0.17553    0.729871    ...    -4.0    -0.424716    0.878188    1.000000    1.851767    0.5    1.000000    -2.805239    0.0    0.0
4    0.000000    1.000000    0.800000    1.214286    0.0    17.0    0.800000    -0.038462    0.17553    1.322950    ...    -5.0    -1.078108    3.678858    -3.618034    -1.466977    -0.5    -1.381966    -0.633773    0.0    0.0
5    2.000000    0.866667    0.916667    0.928571    0.0    13.0    0.916667    0.038462    0.17553    1.020037    ...    -2.0    -3.743460    3.049653    -0.618034    1.198375    -0.5    1.618034    -0.004568    0.0    0.0
5 rows × 1332 columns
View Code
X.info()
#结果:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 88 entries, 1 to 88
Columns: 1332 entries, a__mean_abs_change_quantiles__qh_1.0__ql_0.8 to f__fft_coefficient__coeff_9
dtypes: float64(1332)
memory usage: 916.4 KB

  

X_filtered = extract_relevant_features(df, y, 
                                       column_id='id', column_sort='time', 
                                       default_fc_parameters=extraction_settings)#特征过滤,选择最相关的特征。具体了解查看官方文档

  

X_filtered.head()#新特征

  结果:

a__abs_energy    a__range_count__max_1__min_-1    b__abs_energy    e__variance    e__standard_deviation    e__abs_energy    c__standard_deviation    c__variance    a__standard_deviation    a__variance    ...    b__has_duplicate_max    b__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_14__w_5    b__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_13__w_2    e__quantile__q_0.1    a__ar_coefficient__k_10__coeff_1    a__quantile__q_0.2    b__quantile__q_0.7    f__large_standard_deviation__r_0.35    f__quantile__q_0.9    d__spkt_welch_density__coeff_5
id                                                                                    
1    14.0    15.0    13.0    0.222222    0.471405    10.0    1.203698    1.448889    0.249444    0.062222    ...    1.0    -0.751682    -0.310265    -1.0    0.125000    -1.0    -1.0    0.0    0.0    0.037795
2    25.0    13.0    76.0    4.222222    2.054805    90.0    4.333846    18.782222    0.956847    0.915556    ...    1.0    0.057818    -0.202951    -3.6    -0.078829    -1.0    -1.0    1.0    0.0    0.319311
3    12.0    14.0    40.0    3.128889    1.768867    103.0    4.616877    21.315556    0.596285    0.355556    ...    0.0    0.912474    0.539121    -4.0    0.084836    -1.0    0.0    1.0    0.0    9.102780
4    16.0    10.0    60.0    7.128889    2.669998    124.0    3.833188    14.693333    0.952190    0.906667    ...    0.0    -0.609735    -2.641390    -4.6    0.003108    -1.0    1.0    0.0    0.0    56.910262
5    17.0    13.0    46.0    4.160000    2.039608    180.0    4.841487    23.440000    0.879394    0.773333    ...    0.0    0.072771    0.591927    -5.0    0.087906    -1.0    0.8    0.0    0.6    22.841805
5 rows × 300 columns
View Code
X_filtered.info()

  结果:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88 entries, 1 to 88
Columns: 300 entries, a__abs_energy to d__spkt_welch_density__coeff_5
dtypes: float64(300)
memory usage: 206.9 KB
View Code
X_train, X_test, X_filtered_train, X_filtered_test, y_train, y_test = train_test_split(X, X_filtered, y, test_size=.4)

  

cl = DecisionTreeClassifier()
cl.fit(X_train, y_train)
print(classification_report(y_test, cl.predict(X_test)))#对模型进行评估,可以看出这个结果还不错

  结果:

precision    recall  f1-score   support

          0       1.00      0.89      0.94         9
          1       0.96      1.00      0.98        27

avg / total       0.97      0.97      0.97        36
cl.n_features_#1332

  

cl2 = DecisionTreeClassifier()
cl2.fit(X_filtered_train, y_train)
print(classification_report(y_test, cl2.predict(X_filtered_test)))

  结果:

cl2 = DecisionTreeClassifier()
cl2.fit(X_filtered_train, y_train)
print(classification_report(y_test, cl2.predict(X_filtered_test)))
cl2 = DecisionTreeClassifier()
cl2.fit(X_filtered_train, y_train)
print(classification_report(y_test, cl2.predict(X_filtered_test)))
             precision    recall  f1-score   support

          0       1.00      0.78      0.88         9
          1       0.93      1.00      0.96        27

avg / total       0.95      0.94      0.94        36
View Code
cl2.n_features_#300

维基百科词条EDA

探索性数据分析(EDA)目的是最大化对数据的直觉,完成这个事情的方法只能是结合统计学的图形以各种形式展现出来。通过EDA可以实现: 
1. 得到数据的直观表现 
2. 发现潜在的结构 
3. 提取重要的变量 
4. 处理异常值 
5. 检验统计假设 
6. 建立初步模型 
7. 决定最优因子的设置

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
%matplotlib inline

  

train = pd.read_csv('train_1.csv').fillna(0)
train.head()

  结果:

Page    2015-07-01    2015-07-02    2015-07-03    2015-07-04    2015-07-05    2015-07-06    2015-07-07    2015-07-08    2015-07-09    ...    2016-12-22    2016-12-23    2016-12-24    2016-12-25    2016-12-26    2016-12-27    2016-12-28    2016-12-29    2016-12-30    2016-12-31
0    2NE1_zh.wikipedia.org_all-access_spider    18.0    11.0    5.0    13.0    14.0    9.0    9.0    22.0    26.0    ...    32.0    63.0    15.0    26.0    14.0    20.0    22.0    19.0    18.0    20.0
1    2PM_zh.wikipedia.org_all-access_spider    11.0    14.0    15.0    18.0    11.0    13.0    22.0    11.0    10.0    ...    17.0    42.0    28.0    15.0    9.0    30.0    52.0    45.0    26.0    20.0
2    3C_zh.wikipedia.org_all-access_spider    1.0    0.0    1.0    1.0    0.0    4.0    0.0    3.0    4.0    ...    3.0    1.0    1.0    7.0    4.0    4.0    6.0    3.0    4.0    17.0
3    4minute_zh.wikipedia.org_all-access_spider    35.0    13.0    10.0    94.0    4.0    26.0    14.0    9.0    11.0    ...    32.0    10.0    26.0    27.0    16.0    11.0    17.0    19.0    10.0    11.0
4    52_Hz_I_Love_You_zh.wikipedia.org_all-access_s...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    ...    48.0    9.0    25.0    13.0    3.0    11.0    27.0    13.0    36.0    10.0
5 rows × 551 columns
View Code
train.info()
结果:<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145063 entries, 0 to 145062
Columns: 551 entries, Page to 2016-12-31
dtypes: float64(550), object(1)
memory usage: 609.8+ MB

  

for col in train.columns[1:]:
    train[col] = pd.to_numeric(train[col],downcast='integer')#float数据较为占内存,从上表可以看出,小数点后都是0,可将数据转换为int,减小内存。
train.head()

  结果:

Page    2015-07-01    2015-07-02    2015-07-03    2015-07-04    2015-07-05    2015-07-06    2015-07-07    2015-07-08    2015-07-09    ...    2016-12-22    2016-12-23    2016-12-24    2016-12-25    2016-12-26    2016-12-27    2016-12-28    2016-12-29    2016-12-30    2016-12-31
0    2NE1_zh.wikipedia.org_all-access_spider    18    11    5    13    14    9    9    22    26    ...    32    63    15    26    14    20    22    19    18    20
1    2PM_zh.wikipedia.org_all-access_spider    11    14    15    18    11    13    22    11    10    ...    17    42    28    15    9    30    52    45    26    20
2    3C_zh.wikipedia.org_all-access_spider    1    0    1    1    0    4    0    3    4    ...    3    1    1    7    4    4    6    3    4    17
3    4minute_zh.wikipedia.org_all-access_spider    35    13    10    94    4    26    14    9    11    ...    32    10    26    27    16    11    17    19    10    11
4    52_Hz_I_Love_You_zh.wikipedia.org_all-access_s...    0    0    0    0    0    0    0    0    0    ...    48    9    25    13    3    11    27    13    36    10
5 rows × 551 columns
View Code
train.info()
结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145063 entries, 0 to 145062
Columns: 551 entries, Page to 2016-12-31
dtypes: int32(550), object(1)
memory usage: 305.5+ MB

  

def get_language(page):#将词条按国家分类
    res = re.search('[a-z][a-z].wikipedia.org',page)
    #print (res.group()[0:2])
    if res:
        return res.group()[0:2]
    return 'na'

train['lang'] = train.Page.map(get_language)

from collections import Counter

print(Counter(train.lang))

  结果:Counter({'en': 24108, 'ja': 20431, 'de': 18547, 'na': 17855, 'fr': 17802, 'zh': 17229, 'ru': 15022, 'es': 14069})

 

lang_sets = {}
lang_sets['en'] = train[train.lang=='en'].iloc[:,0:-1]
lang_sets['ja'] = train[train.lang=='ja'].iloc[:,0:-1]
lang_sets['de'] = train[train.lang=='de'].iloc[:,0:-1]
lang_sets['na'] = train[train.lang=='na'].iloc[:,0:-1]
lang_sets['fr'] = train[train.lang=='fr'].iloc[:,0:-1]
lang_sets['zh'] = train[train.lang=='zh'].iloc[:,0:-1]
lang_sets['ru'] = train[train.lang=='ru'].iloc[:,0:-1]
lang_sets['es'] = train[train.lang=='es'].iloc[:,0:-1]

sums = {}
for key in lang_sets:
    sums[key] = lang_sets[key].iloc[:,1:].sum(axis=0) / lang_sets[key].shape[0]

  

days = [r for r in range(sums['en'].shape[0])]

fig = plt.figure(1,figsize=[10,10])
plt.ylabel('Views per Page')
plt.xlabel('Day')
plt.title('Pages in Different Languages')
labels={'en':'English','ja':'Japanese','de':'German',
        'na':'Media','fr':'French','zh':'Chinese',
        'ru':'Russian','es':'Spanish'
       }

for key in sums:
    plt.plot(days,sums[key],label = labels[key] )
    
plt.legend()
plt.show()

  结果:

def plot_entry(key,idx):
    data = lang_sets[key].iloc[idx,1:]
    fig = plt.figure(1,figsize=(10,5))
    plt.plot(days,data)
    plt.xlabel('day')
    plt.ylabel('views')
    plt.title(train.iloc[lang_sets[key].index[idx],0])
    
    plt.show()

  

idx = [1, 5, 10, 50, 100, 250,500, 750,1000,1500,2000,3000,4000,5000]
for i in idx:#按词条分类
    plot_entry('en',i)

  结果:

npages = 5
top_pages = {}
for key in lang_sets:
    print(key)
    sum_set = pd.DataFrame(lang_sets[key][['Page']])
    sum_set['total'] = lang_sets[key].sum(axis=1)
    sum_set = sum_set.sort_values('total',ascending=False)
    print(sum_set.head(10))
    top_pages[key] = sum

  结果:

zh
                                                     Page      total
28727   Wikipedia:首页_zh.wikipedia.org_all-access_all-a...  123694312
61350    Wikipedia:首页_zh.wikipedia.org_desktop_all-agents   66435641
105844  Wikipedia:首页_zh.wikipedia.org_mobile-web_all-a...   50887429
28728   Special:搜索_zh.wikipedia.org_all-access_all-agents   48678124
61351      Special:搜索_zh.wikipedia.org_desktop_all-agents   48203843
28089   Running_Man_zh.wikipedia.org_all-access_all-ag...   11485845
30960   Special:链接搜索_zh.wikipedia.org_all-access_all-a...   10320403
63510    Special:链接搜索_zh.wikipedia.org_desktop_all-agents   10320336
60711     Running_Man_zh.wikipedia.org_desktop_all-agents    7968443
30446    瑯琊榜_(電視劇)_zh.wikipedia.org_all-access_all-agents    5891589



fr
                                                     Page      total
27330   Wikipédia:Accueil_principal_fr.wikipedia.org_a...  868480667
55104   Wikipédia:Accueil_principal_fr.wikipedia.org_m...  611302821
7344    Wikipédia:Accueil_principal_fr.wikipedia.org_d...  239589012
27825   Spécial:Recherche_fr.wikipedia.org_all-access_...   95666374
8221    Spécial:Recherche_fr.wikipedia.org_desktop_all...   88448938
26500   Sp?cial:Search_fr.wikipedia.org_all-access_all...   76194568
6978    Sp?cial:Search_fr.wikipedia.org_desktop_all-ag...   76185450
131296  Wikipédia:Accueil_principal_fr.wikipedia.org_a...   63860799
26993   Organisme_de_placement_collectif_en_valeurs_mo...   36647929
7213    Organisme_de_placement_collectif_en_valeurs_mo...   36624145



ru
                                                     Page       total
99322   Заглавная_страница_ru.wikipedia.org_all-access...  1086019452
103123  Заглавная_страница_ru.wikipedia.org_desktop_al...   742880016
17670   Заглавная_страница_ru.wikipedia.org_mobile-web...   327930433
99537   Служебная:Поиск_ru.wikipedia.org_all-access_al...   103764279
103349  Служебная:Поиск_ru.wikipedia.org_desktop_all-a...    98664171
100414  Служебная:Ссылки_сюда_ru.wikipedia.org_all-acc...    25102004
104195  Служебная:Ссылки_сюда_ru.wikipedia.org_desktop...    25058155
97670   Special:Search_ru.wikipedia.org_all-access_all...    24374572
101457  Special:Search_ru.wikipedia.org_desktop_all-ag...    21958472
98301   Служебная:Вход_ru.wikipedia.org_all-access_all...    12162587



ja
                                                     Page      total
120336      メインページ_ja.wikipedia.org_all-access_all-agents  210753795
86431          メインページ_ja.wikipedia.org_desktop_all-agents  134147415
123025       特別:検索_ja.wikipedia.org_all-access_all-agents   70316929
89202           特別:検索_ja.wikipedia.org_desktop_all-agents   69215206
57309       メインページ_ja.wikipedia.org_mobile-web_all-agents   66459122
119609    特別:最近の更新_ja.wikipedia.org_all-access_all-agents   17662791
88897        特別:最近の更新_ja.wikipedia.org_desktop_all-agents   17627621
119625        真田信繁_ja.wikipedia.org_all-access_all-agents   10793039
123292  特別:外部リンク検索_ja.wikipedia.org_all-access_all-agents   10331191
89463      特別:外部リンク検索_ja.wikipedia.org_desktop_all-agents   10327917



es
                                                     Page      total
92205   Wikipedia:Portada_es.wikipedia.org_all-access_...  751492304
95855   Wikipedia:Portada_es.wikipedia.org_mobile-web_...  565077372
90810   Especial:Buscar_es.wikipedia.org_all-access_al...  194491245
71199   Wikipedia:Portada_es.wikipedia.org_desktop_all...  165439354
69939   Especial:Buscar_es.wikipedia.org_desktop_all-a...  160431271
94389   Especial:Buscar_es.wikipedia.org_mobile-web_al...   34059966
90813   Especial:Entrar_es.wikipedia.org_all-access_al...   33983359
143440  Wikipedia:Portada_es.wikipedia.org_all-access_...   31615409
93094   Lali_Espósito_es.wikipedia.org_all-access_all-...   26602688
69942   Especial:Entrar_es.wikipedia.org_desktop_all-a...   25747141



en
                                                    Page        total
38573   Main_Page_en.wikipedia.org_all-access_all-agents  12066181102
9774       Main_Page_en.wikipedia.org_desktop_all-agents   8774497458
74114   Main_Page_en.wikipedia.org_mobile-web_all-agents   3153984882
39180  Special:Search_en.wikipedia.org_all-access_all...   1304079353
10403  Special:Search_en.wikipedia.org_desktop_all-ag...   1011847748
74690  Special:Search_en.wikipedia.org_mobile-web_all...    292162839
39172  Special:Book_en.wikipedia.org_all-access_all-a...    133993144
10399   Special:Book_en.wikipedia.org_desktop_all-agents    133285908
33644       Main_Page_en.wikipedia.org_all-access_spider    129020407
34257  Special:Search_en.wikipedia.org_all-access_spider    124310206



na
                                                    Page     total
45071  Special:Search_commons.wikimedia.org_all-acces...  67150638
81665  Special:Search_commons.wikimedia.org_desktop_a...  63349756
45056  Special:CreateAccount_commons.wikimedia.org_al...  53795386
45028  Main_Page_commons.wikimedia.org_all-access_all...  52732292
81644  Special:CreateAccount_commons.wikimedia.org_de...  48061029
81610  Main_Page_commons.wikimedia.org_desktop_all-ag...  39160923
46078  Special:RecentChangesLinked_commons.wikimedia....  28306336
45078  Special:UploadWizard_commons.wikimedia.org_all...  23733805
81671  Special:UploadWizard_commons.wikimedia.org_des...  22008544
82680  Special:RecentChangesLinked_commons.wikimedia....  21915202



de
                                                     Page       total
139119  Wikipedia:Hauptseite_de.wikipedia.org_all-acce...  1603934248
116196  Wikipedia:Hauptseite_de.wikipedia.org_mobile-w...  1112689084
67049   Wikipedia:Hauptseite_de.wikipedia.org_desktop_...   426992426
140151  Spezial:Suche_de.wikipedia.org_all-access_all-...   223425944
66736   Spezial:Suche_de.wikipedia.org_desktop_all-agents   219636761
140147  Spezial:Anmelden_de.wikipedia.org_all-access_a...    40291806
138800  Special:Search_de.wikipedia.org_all-access_all...    39881543
68104   Spezial:Anmelden_de.wikipedia.org_desktop_all-...    35355226
68511   Special:MyPage/toolserverhelferleinconfig.js_d...    32584955
137765  Hauptseite_de.wikipedia.org_all-access_all-agents    31732458
View Code
for key in top_pages:
    fig = plt.figure(1,figsize=(10,5))
    cols = train.columns
    cols = cols[1:-1]
    data = train.loc[top_pages[key],cols]
    plt.plot(days,data)
    plt.xlabel('Days')
    plt.ylabel('Views')
    plt.title(train.loc[top_pages[key],'Page'])
    plt.show()

  结果:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  

 

posted @ 2018-07-07 15:22  耐烦不急  阅读(44472)  评论(3编辑  收藏  举报