# 《数据分析实战-托马兹.卓巴斯》读书笔记第7章-时间序列技术（ARMA模型、ARIMA模型）

python数据分析个人学习读书笔记-目录索引

·在Python中如何处理日期对象
·理解时间序列数据
·*滑并转换观测值
·过滤时间序列数据
·移除趋势和季节性
·使用ARMA和ARIMA模型预测未来

7.1导论

/*
sh riverflows.webarchive
*/

7.2在Python中如何处理日期对象

1 import numpy as np
2 import pandas as pd
3 import pandas.tseries.offsets as ofst
4 import matplotlib
5 import matplotlib.pyplot as plt
6
7 # change the font size
8 matplotlib.rc('xtick', labelsize=9)
9 matplotlib.rc('ytick', labelsize=9)
10 matplotlib.rc('font', size=14)
11
12 # files we'll be working with
13 files=['american.csv', 'columbia.csv']
14
15 # folder with data
16 data_folder = '../../Data/Chapter07/'
17
18 # colors
19 colors = ['#FF6600', '#000000', '#29407C', '#660000']
20
22 american = pd.read_csv(data_folder + files[0],
23     index_col=0, parse_dates=[0],
25
26 columbia = pd.read_csv(data_folder + files[1],
27     index_col=0, parse_dates=[0],
29
30 # combine the datasets
31 riverFlows = american.combine_first(columbia)
32
33 # periods aren't equal in the two datasets so find the overlap
34 # find the first month where the flow is missing for american
35 idx_american = riverFlows \
36     .index[riverFlows['american_flow'].apply(np.isnan)].min()
37
38 # find the last month where the flow is missing for columbia
39 idx_columbia = riverFlows \
40     .index[riverFlows['columbia_flow'].apply(np.isnan)].max()
41
42 # truncate the time series
43 riverFlows = riverFlows.truncate(
44     before=idx_columbia + ofst.DateOffset(months=1),
45     after=idx_american - ofst.DateOffset(months=1))

Tips:

/*
Traceback (most recent call last):
File "D:\Java2018\practicalDataAnalysis\Codes\Chapter07\ts_handlingData.py", line 49, in <module>
o.write(riverFlows.to_csv(ignore_index=True))
TypeError: to_csv() got an unexpected keyword argument 'ignore_index'

D:\Java2018\practicalDataAnalysis\Codes\Chapter07\ts_handlingData.py:80: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).mean()
year = riverFlows.resample('A', how='mean')

*/

/*
# year = riverFlows.resample('A', how='mean')
year = riverFlows.resample('A').mean()

# o.write(riverFlows.to_csv(ignore_index=True))
o.write(riverFlows.to_csv(index=True))
*/

.truncate（...）方法可以根据DatetimeIndex从DataFrame中移除数据。
DatetimeIndex是数字组成的不可变数组。内部由大整数表示，但看上去是日期-时间对象：既有日期部分也有时间部分的对象。

idx_...对象保存了至少有一列没有数据的日期的最小值和最大值。不过，如果将这些日期传入.truncate（...）方法，我们也会选出同样没有数据的极值点。应对这种情况，我们用.DateOffset（...）方法将日期移个位。我们只挪一个月。

/*
Index of riverFlows
DatetimeIndex(['1933-01-31', '1933-02-28', '1933-03-31', '1933-04-30',
'1933-05-31', '1933-06-30', '1933-07-31', '1933-08-31',
'1933-09-30', '1933-10-31',
...
'1960-03-31', '1960-04-30', '1960-05-31', '1960-06-30',
'1960-07-31', '1960-08-31', '1960-09-30', '1960-10-31',
'1960-11-30', '1960-12-31'],
dtype='datetime64[ns]', name='', length=336, freq=None)

american_flow  columbia_flow

1933-01-31        10.7887          24.10
1933-02-28        14.6115          20.81
1933-03-31        19.6236          22.96
1933-04-30        21.9739          37.66
1933-05-31        28.0054         118.93
1933-06-30        66.0632         331.31
1933-07-31       113.4373         399.27
1933-08-31       162.0007         250.89
1933-09-30       156.6771         116.10
1933-10-31        17.9246          69.38
1933-11-30         7.0792          52.95
1933-12-31         4.0493          40.21
1934-01-31        11.5816          35.40
1934-02-28        18.5192          28.88
1934-03-31        53.8586          33.41
1934-04-30        75.8608         102.22
1934-05-31        89.3963         259.67
1934-06-30       116.2973         390.77

Shifting one month forward
american_flow  columbia_flow

1933-02-28        10.7887          24.10
1933-03-31        14.6115          20.81
1933-04-30        19.6236          22.96
1933-05-31        21.9739          37.66
1933-06-30        28.0054         118.93
1933-07-31        66.0632         331.31

Shifting one year forward
american_flow  columbia_flow

1934-01-31        10.7887          24.10
1934-02-28        14.6115          20.81
1934-03-31        19.6236          22.96
1934-04-30        21.9739          37.66
1934-05-31        28.0054         118.93
1934-06-30        66.0632         331.31

Averaging by quarter
american_flow  columbia_flow

1933-03-31      15.007933      22.623333
1933-06-30      38.680833     162.633333

Averaging by half a year
american_flow  columbia_flow

1933-01-31      10.788700      24.100000
1933-07-31      43.952483     155.156667

Averaging by year
american_flow  columbia_flow

1933-12-31      51.852875     123.714167
1934-12-31      44.334742     128.226667 */
View Code

print(riverFlows['1933':'1934-06'])

1 # shifting the data
2 by_month = riverFlows.shift(1, freq='M')
3 by_year = riverFlows.shift(12, freq='M')

1 # averaging by quarter
2 quarter = riverFlows.resample('Q').mean()
3 # averaging by half a year
4 half = riverFlows.resample('6M').mean()
5 # averaging by year
6 year = riverFlows.resample('A').mean()

A是年末，6M是半年末，即十二月和六月。

1 import matplotlib
2 import matplotlib.pyplot as plt
3 matplotlib.rc('font', size=14)
4
5 # colors
6 colors = ['#FF6600', '#000000', '#29407C', '#660000']
7
8 # plot time series
9 # monthly time series
10 riverFlows.plot(title='Monthly river flows', color=colors)
11 plt.savefig(data_folder + '/charts/monthly_riverFlows.png',
12     dpi=300)
13 plt.close()

Matplotlib的配置项可参考http://matplotlib.org/users/customizing.html

.plot（...）方法绘图。我们定义标题，允许方法从颜色列表中挑选颜色。最后，将图保存到文件并关闭，以释放内存。

1 # time series tools
2 import statsmodels.api as sm
3
5 riverFlows = pd.read_csv(data_folder + 'combined_flow.csv',
6     index_col=0, parse_dates=[0])
7
8 # autocorrelation function
9 acf = {}    # to store the results
10 f = {}
11
12 for col in riverFlows.columns:
13     acf[col] = sm.tsa.stattools.acf(riverFlows[col])
14
15 # partial autocorrelation function
16 pacf = {}
17
18 for col in riverFlows.columns:
19     pacf[col] = sm.tsa.stattools.pacf(riverFlows[col])
20
21 # periodogram (spectral density)
22 sd = {}
23
24 for col in riverFlows.columns:
25     sd[col] = sm.tsa.stattools.periodogram(riverFlows[col])

1 # plot the data
2 fig, ax = plt.subplots(2, 3) # 2 rows and 3 columns
3
4 # set the size of the figure explicitly
5 fig.set_size_inches(12, 7)
6
7 # plot the charts for American
8 ax[0, 0].plot(acf['american_flow'], colors[0])
9 ax[0, 1].plot(pacf['american_flow'],colors[1])
10 ax[0, 2].plot(sd['american_flow'],  colors[2])
11 ax[0, 2].yaxis.tick_right() # shows the numbers on the right
12
13 # plot the charts for Columbia
14 ax[1, 0].plot(acf['columbia_flow'], colors[0])
15 ax[1, 1].plot(pacf['columbia_flow'],colors[1])
16 ax[1, 2].plot(sd['columbia_flow'],  colors[2])
17 ax[1, 2].yaxis.tick_right()
18
19 # set titles for columns
20 ax[0, 0].set_title('ACF')
21 ax[0, 1].set_title('PACF')
22 ax[0, 2].set_title('Spectral density')
23
24 # set titles for rows
25 ax[0, 0].set_ylabel('American')
26 ax[1, 0].set_ylabel('Columbia')

Tips:

*稳过程指的是，方差和联合概率不随时间变化（http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc442.htm）。
PACF展示了，时刻t的观测值强烈依赖于时刻t-1和t-2的观测值（对更之前的快照依赖较少）。分析谱密度可知基础频率大约在29；即，每29个月会重复一个基本的模式。这出乎我们的意料，我们本以为序列重复的间隔会是12的倍数。这可能是多种因素的影响：错误记录的观测值，错误的设备，或者某些观测年发生的基本偏移模式。

1 # plot the data
2 fig, ax = plt.subplots(2, 2, sharex=True)
3
4 # set the size of the figure explicitly
5 fig.set_size_inches(8, 7)
6
7 # plot the charts for american
8 sm.graphics.tsa.plot_acf(
9     riverFlows['american_flow'].squeeze(),
10     lags=40, ax=ax[0, 0])
11
12 sm.graphics.tsa.plot_pacf(
13     riverFlows['american_flow'].squeeze(),
14     lags=40, ax=ax[0, 1])
15
16 # plot the charts for columbia
17 sm.graphics.tsa.plot_acf(
18     riverFlows['columbia_flow'].squeeze(),
19     lags=40, ax=ax[1, 0])
20
21 sm.graphics.tsa.plot_pacf(
22     riverFlows['columbia_flow'].squeeze(),
23     lags=40, ax=ax[1, 1])
24
25 # set titles for rows
26 ax[0, 0].set_ylabel('American')
27 ax[1, 0].set_ylabel('Columbia')

.plot_acf（...）方法和.plot_pacf（...）方法都以用来绘图的数据作为第一个参数。.squeeze（）方法将单列的DataFrame转换为一个序列对象。这里，我们显式指定时间间隔。ax参数指定了图表放置的位置。

7.4*滑并转换观测值

1 ma_transform12  = pd.rolling_mean(riverFlows, window=12)
2 ma_transformExp = pd.ewma(riverFlows, span=3)
3 log_transfrom   = riverFlows.apply(np.log)

.ewma（...）方法使用一个指数衰减的函数来*滑数据。span参数控制影响窗口——计算加权*均时还有多少相关的观测值。

MA*滑技巧移除了很多噪音，揭示了数据中的局部趋势；对美国河而言，水位从开始到1942年左右是上升的，然后直到1950年都是下降的，之后又是有升有降。而对于哥伦比亚河，你可以看到一开始的下降趋势，1945年左右才翻转过来。

1 import pandas as pd
2 import numpy as np
3
4 def holt_transform(column, alpha):
5     '''
6         Method to apply Holt transform
7
8         The transform is given as
9         y(t) = alpha * x(t) + (1-alpha) y(t-1)
10     '''
11     # create an np.array from the column
12     original = np.array(column)
13
14     # starting point for the transformation
15     transformed = [original[0]]
16
17     # apply the transform to the rest of the data
18     for i in range(1, len(original)):
19         transformed.append(
20             original[i] * alpha +
21             (1-alpha) * transformed[-1])
22
23     return transformed

*滑后时刻t的值由其观测值与之前*滑后的值决定；每个观测值的影响由alpha参数控制。

# transformations
ma_transformHolt = riverFlows.apply(
lambda col: holt_transform(col, 0.5), axis=0)

axis设为0时，.apply（...）方法将列逐一传入holt_transform（...）方法。

difference = riverFlows - riverFlows.shift()

7.5过滤时间序列数据

1 # prepare different filters准备不同的滤波器
2 MA_filter = [1] * 12
3 linear_filter = [d * (1 / 12) for d in range(0, 13)]
4 gaussian = [0] * 6 + list(sc.signal.gaussian(13, 2)[:7])
5
6 # convolve卷积
7 conv_ma = riverFlows.apply(
8     lambda col: sm.tsa.filters.convolution_filter(
9         col, MA_filter), axis=0).dropna()
10
11 conv_linear = riverFlows.apply(
12     lambda col: sm.tsa.filters.convolution_filter(
13         col, linear_filter), axis=0).dropna()
14
15 conv_gauss = riverFlows.apply(
16     lambda col: sm.tsa.filters.convolution_filter(
17         col, gaussian), axis=0).dropna()

https://www0.gsb.columbia.edu/faculty/rhodrick/prescott-hodrick1997.pdf（Hodrick-Prescott）

http://www.nber.org/papers/w7257.pdf（Christiano-Fitzgerald）。

bkfilter = sm.tsa.filters.bkfilter(riverFlows, 18, 96, 12)
hpcycle, hptrend = sm.tsa.filters.hpfilter(riverFlows, 129600)
cfcycle, cftrend = sm.tsa.filters.cffilter(riverFlows,
18, 96, False)

BK过滤器是一个带通滤波器；从时间序列中移除高频和低频。.bkfilter（...）方法以数据作为第一个参数。下一个参数是振动的最小周期；对于月度数据，推荐使用18，季度推荐使用6，年度推荐使用1.5。下一个参数定义了振动的最大周期：对于月度数据，推荐使用96。最后一个参数决定了滤波器的超前-滞后（或者说，一个窗口）。
HP过滤器通过解决一个最小化问题，将初始的时间序列拆成趋势和业务周期组件。.hpfilter（...）方法以数据作为第一个参数及*滑的参数。通常，对季度数据使用1600作为频率，年度数据使用6.25，月度数据（即本例中）使用129600。
CF过滤器将初始的时间序列拆成趋势和业务周期组件，在这个意义上来说，和HP过滤器类似。.cffilter（...）也是以数据作为第一个参数。与BK过滤器类似，接下来两个参数指定了振动的最小周期和最大周期。最后一个参数drift指定了是否要从数据中移除趋势。

BK过滤器从数据中移除幅度，并使其静止。分析HP过滤器的输出可以看出，哥伦比亚河的长期趋势几乎是恒定的，而美国河的趋势一直在变。美国河的周期组件也体现了类似的模式。CF过滤器的输出证明了这一点：美国河的趋势组件比哥伦比亚河的更多变。

7.6移除趋势和季节性

1 import pandas as pd
2 import numpy as np
3 import matplotlib
4 import matplotlib.pyplot as plt
5
6 # change the font size
7 matplotlib.rc('xtick', labelsize=9)
8 matplotlib.rc('ytick', labelsize=9)
9 matplotlib.rc('font', size=14)
10
11 # time series tools
12 import statsmodels.api as sm
13
14 def period_mean(data, freq):
15     '''
16         Method to calculate mean for each frequency
17     '''
18     return np.array(
19         [np.mean(data[i::freq]) for i in range(freq)])
20
21 # folder with data
22 data_folder = '../../Data/Chapter07/'
23
24 # colors
25 colors = ['#FF6600', '#000000', '#29407C', '#660000']
26
28 # riverFlows = pd.read_csv(data_folder + 'combined_flow.csv',
29 #     index_col=0, parse_dates=[0])
30 riverFlows = pd.read_csv(data_folder + 'combined_flow.csv')
31
32
33 # detrend the data
34 detrended = sm.tsa.tsatools.detrend(riverFlows,
35     order=1, axis=0)
36
37 # create a data frame with the detrended data
38 detrended = pd.DataFrame(detrended, index=riverFlows.index,
39     columns=['american_flow_d', 'columbia_flow_d'])
40
41 # join to the main dataset
42 riverFlows = riverFlows.join(detrended)
43
44 # calculate trend
45 riverFlows['american_flow_t'] = riverFlows['american_flow'] \
46     - riverFlows['american_flow_d']
47 riverFlows['columbia_flow_t'] = riverFlows['columbia_flow'] \
48     - riverFlows['columbia_flow_d']
49
50 # number of observations and frequency of seasonal component
51 nobs = len(riverFlows)
52 freq = 12   # yearly seasonality
53
54 # remove the seasonality
55 for col in ['american_flow_d', 'columbia_flow_d']:
56     period_averages = period_mean(riverFlows[col], freq)
57     riverFlows[col[:-2]+'_s'] = np.tile(period_averages,
58         nobs // freq + 1)[:nobs]
59     riverFlows[col[:-2]+'_r'] = np.array(riverFlows[col]) \
60         - np.array(riverFlows[col[:-2]+'_s'])

.detrend（...）方法返回一个NumPy数组；在我们的例子中，这是一个二维数组，因为初始的数据集中有两列。所以，为了便于加入初始数据集，下一步我们用移除趋势的数据创建一个DataFrame。我们重用了初始DataFrame的索引：riverFlows。而且，为了避免和初始数据集冲突，列名加上后缀_d。
pandas的.join（...）方法（默认情况下）根据索引合并两个DataFrame：匹配两个DataFrame的索引，返回对应的值。不过，.join（...）方法给你控制权，允许你指定要连接到的列；键-列在两个数据集中都要有。它也允许你指定如何连接DataFrame：默认情况下，这是个左连接，从调用方DataFrame的每行记录出发，连接传入DataFrame中键名匹配的所有值；由于两个DataFrame中索引相同，我们只需要调用这个方法即可。

/* 错，生成空白！！！！ */

/*
# create a data frame with the detrended data
#用移除趋势的数据创建一个DateFrame
# detrended = pd.DataFrame(detrended, index=riverFlows.index,
#     columns=['american_flow_d', 'columbia_flow_d'])
#原方法中列'american_flow_d', 'columbia_flow_d'都是NaN
detrended.rename(columns={'american_flow':'american_flow_d', 'columbia_flow':'columbia_flow_d'}, inplace = True)
*/

NumPy的.mean（...）方法计算数组中所有元素的*均值。对于每条河流，我们得到下面的结果：

//操作符做的是整数除法，将得到的商圆整到最*的整数。

Statsmodels有另一个分解时间序列的有用方法：.seasonal_decompose（...）。

1  for col in riverFlows.columns:
2     # seasonal decomposition of the data
3     sd = sm.tsa.seasonal_decompose(riverFlows[col],
4         model='a', freq=12)
5
6     riverFlows[col + '_resid'] = sd.resid \
7         .fillna(np.mean(sd.resid))
8
9     riverFlows[col + '_trend'] = sd.trend \
10         .fillna(np.mean(sd.trend))
11
12     riverFlows[col + '_seas'] = sd.seasonal \
13         .fillna(np.mean(sd.seasonal))

7.7使用ARMA和ARIMA模型预测未来

ARMA（autoregressive moving average，自回归移动*均）模型及其泛化——ARIMA（autoregressive integrated moving average，差分自回归移动*均）模型——是从时间序列预测未来时常用的两个模型。ARIMA的泛化是指这是一个整体：模型的第一步是在估算AR和MA之前做差分。

1 def plot_functions(data, name):
2     '''
3         Method to plot the ACF and PACF functions
4     '''
5     # create the figure
6     fig, ax = plt.subplots(2)
7
8     # plot the functions
9     sm.graphics.tsa.plot_acf(data, lags=18, ax=ax[0])
10     sm.graphics.tsa.plot_pacf(data, lags=18, ax=ax[1])
11
12     # set titles for charts
13     ax[0].set_title(name.split('_')[-1])
14     ax[1].set_title('')
15
16     # set titles for rows
17     ax[0].set_ylabel('ACF')
18     ax[1].set_ylabel('PACF')
19
20     # save the figure
21     plt.savefig(data_folder+'/charts/'+name+'.png',
22         dpi=300)
23
24 def fit_model(data, params, modelType, f, t):
25     '''
26         Wrapper method to fit and plot the model
27     '''
28     # create the model object
29     model = sm.tsa.ARIMA(data, params)
30
31     # fit the model
32     model_res = model.fit(maxiter=600, trend='nc',
33         start_params=[.1] * (params[0]+params[2]), tol=1e-06)
34
35     # create figure
36     fig, ax = plt.subplots(1, figsize=(12, 8))
37
38     e = model.geterrors(model_res.params)
39     ax.plot(e, colors[3])
40
41     chartText = '{0}: ({1}, {2}, {3})'.format(
42         modelType.split('_')[0], params[0],
43         params[1], params[2])
44
45     # and put it on the chart
46     ax.text(0.1, 0.95, chartText, transform=ax.transAxes)
47
48     # and save the figure
49     plt.savefig(data_folder+'/charts/'+modelType+'_errors.png',
50         dpi=300)
51
52
53     # plot the model
54     plot_model(data['1950':], model_res, params,
55         modelType, f, t)
56
57     # and save the figure
58     plt.savefig(data_folder+'/charts/'+modelType+'.png',
59         dpi=300)
60
61 def plot_model(data, model, params, modelType, f, t):
62     '''
63         Method to plot the predictions of the model
64     '''
65     # create figure
66     fig, ax = plt.subplots(1, figsize=(12, 8))
67
68     # plot the data
69     data.plot(ax=ax, color=colors[0])
70
71     # plot the forecast
72     model.plot_predict(f, t, ax=ax, plot_insample=False)
73
74     # define chart text
75     chartText = '{0}: ({1}, {2}, {3})'.format(
76         modelType.split('_')[0], params[0],
77         params[1], params[2])
78
79     # and put it on the chart
80     ax.text(0.1, 0.95, chartText, transform=ax.transAxes)

1 #plot the ACF and PACF functions
2 plot_functions(riverFlows['american_flow_r'],
3     'ACF_PACF_American')
4 plot_functions(riverFlows['columbia_flow_r'],
5     'ACF_PACF_Columbia')

plot_functions（...）方法很像我们在7.3节中写的代码，所以这里我们不讨论它。

1 # fit american models
2 fit_model(riverFlows['american_flow_r'], (2, 0, 4),
3     'ARMA_American', '1960-11-30', '1962')
4 fit_model(riverFlows['american_flow_r'], (2, 1, 4),
5     'ARIMA_American', '1960-11-30', '1962')
6
7 # fit colum models
8 fit_model(riverFlows['columbia_flow_r'], (3, 0, 2),
9     'ARMA_Columbia', '1960-09-30', '1962')
10 fit_model(riverFlows['columbia_flow_r'], (3, 1, 2),
11     'ARIMA_Columbia', '1960-09-30', '1962')

fit_model（...）方法封装了模型构建需要的所有步骤。它以要估算模型的数据作为第一个参数。params参数用于定义模型的参数。modelType参数用于在我们调用plot_model（...）方法时装饰图表；f（从）参数和t（到）参数也传给了plot_model（...）方法。

ARIMA模型的预测和ARMA模型的相差不大；预测一开始遵循观测数据，后来*坦了：

ACF和PACF函数得到的AR和MA参数应该只用作起始点。如果模型表现很好，就保留。否则，你可以微调。

McKinney、Perktold和Seabold对时间序列做出了一个很好的展示/教程。值得一看：http://conference.scipy.org/scipy2011/slides/mckinney_time_series.pdf

python数据分析个人学习读书笔记-目录索引

posted @ 2020-03-28 15:26  邀月  阅读(1113)  评论(0编辑  收藏  举报