时间序列¶

Time series

from __future__ import division
from pandas import Series, DataFrame
import pandas as pd
from numpy.random import randn
import numpy as np
pd.options.display.max_rows = 12
np.set_printoptions(precision=4, suppress=True)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(12, 4))

%matplotlib inline

日期和时间数据类型及工具¶

Date and Time Data Types and Tools
Python标准库包含用于日期（date）和时间（time）数据的数据类型，而且还有日历方面的功能。我们主要会用到datetime，time以及calendar模块。datetime.datetime（也可以简写成datetime）是用的最多的数据类型：

'datetime.now(): 返回现在的时间'
from datetime import datetime
now = datetime.now()
now

datetime.datetime(2016, 6, 2, 16, 42, 8, 849855)

now.year, now.month, now.day

(2016, 5, 25)

'datetime以毫秒形式存储日期和时间。datetime.timedelta表示两个datetime对象之间的时间差 '
delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
delta

datetime.timedelta(926, 56700)

delta.days

926

delta.seconds

56700

'datetime + or - timedelta'
from datetime import timedelta
start = datetime(2011, 1, 7)
start + timedelta(12)

datetime.datetime(2011, 1, 19, 0, 0)

start - 2 * timedelta(12)

datetime.datetime(2010, 12, 14, 0, 0)

datetime模块中的数据类型¶

date      以公立形式存储日历日期（年，月，日）
time      将时间存储为时，分，秒，毫秒
datetime   存储日期和时间
timedelta   表示两个datetime值之间的差（日，秒，毫秒）

字符串和datetime的相互转换¶

Converting between string and datetime

'str 或 strftime 方法（传入一个格式化字符串）： datetime对象和pandas的Timestamp对象 -> 格式化为字符串'
stamp = datetime(2011, 1, 3)

str(stamp)

'2011-01-03 00:00:00'

'strftime: 实例化方法，传入格式化字符串'
stamp.strftime('%Y-%m-%d')

'2011-01-03'

'strptime：用这些格式化编码将字符串转化为日期 (strftime的逆运算) '
value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d')

datetime.datetime(2011, 1, 3, 0, 0)

datestrs = ['7/6/2011', '8/6/2011']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]

[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]

datetime.strptime是通过已知格式进行日期解析的最佳方式。但是每次都要编写格式定义是很麻烦的事情，尤其是对于一些常见的格式。
这种情况下，你可以用dateutil这个第三方包中的parser.parse方法

'dateutil.parser.parse方法：处理常见数据格式很方便'
from dateutil.parser import parse
parse('2011-01-03')

datetime.datetime(2011, 1, 3, 0, 0)

'parse()可以解析非常多的人类能够理解的日期表示形式（中文不行）'
parse('Jan 31, 1997 10:45 PM')

datetime.datetime(1997, 1, 31, 22, 45)

'dayfirst=True: 日 出现在 月 的前面（更符合国际通用的格式）'
parse('6/12/2011', dayfirst=True)

datetime.datetime(2011, 12, 6, 0, 0)

pandas通常是用于处理成组日期的，不管这些日期是DataFrame的轴索引还是列。to_datetime方法可以解析多种不同的日期表示形式。对标准日期格式（ISO8601）的解析非常快。

datestrs

['7/6/2011', '8/6/2011']

pd.to_datetime(datestrs)
# note: output changed (no '00:00:00' anymore)

DatetimeIndex(['2011-07-06', '2011-08-06'], dtype='datetime64[ns]', freq=None)

'它还可以处理缺失值（None,空字符串等）：'
idx = pd.to_datetime(datestrs + [None])
idx

DatetimeIndex(['2011-07-06', '2011-08-06', 'NaT'], dtype='datetime64[ns]', freq=None)

'Nat(Not a Time)是pandas中时间戳数据的NA值。'
idx[2]

NaT

pd.isnull(idx)

array([False, False,  True], dtype=bool)

表10-2：datetime格式定义（兼容ISO C89）¶

%Y    4位数的年
%y    2位数的年
%m    2位数的月[01,12]
%d    2位数的日[01,31]
%H    时（24小时制）[00,23]
%I    时（12小时制）[01,12]
%M    2位数的分[00,59]
%S    秒[00,61](秒60和61用于闰秒）
%w    用整数表示的星期几[0(星期天)，6]
%U    每年的第几周[00,53]。星期天被认为是每周的第一天，每年第一个星期一之前的那几天被认为是‘第0周’
%W    每年的第几周[00,53]。星期一被认为是每周的第一天，每年第一个星期一之前的那几天被认为是‘第0周’
%z    以+HHMM或-HHMM表示的UTC时区偏移量，如果时区为naive,则返回空字符串
%F    %Y-%m-%d简写形式，例如2012-04-18
%D    %m/%d/%y简写形式，例如04/18/12

datetime对象还有一些特定于当前环境（位于不同国家或使用不同语言的系统）的格式化选项。例如，德语或法语系统所用的月份简写就与英语系统所用的不同。

表10-3：特定于当前环境的日期格式¶

%a    星期几的简写
%A    星期几的全称
%b    月份的简写
%B    月份的全称
%c    完整的日期格式，例如‘Tue 01 May 2012 04:20:57 PM’
%p    不同环境中的AM或PM
%x    适合于当前环境的日期格式，例如，在美国，‘May 1,2012’会产生‘05/01/2012’
%X    适合于当前环境的时间格式，例如‘04:24:12 PM’

时间序列基础¶

Time Series Basics

'pandas最基本的时间序列类型就是以时间戳（通常以Python字符串或datatime对象表示）为索引的Series'
from datetime import datetime
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7),
         datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = Series(np.random.randn(6), index=dates)
ts

2011-01-02   -1.797416
2011-01-05   -0.767166
2011-01-07    0.247757
2011-01-08    0.943761
2011-01-10   -0.754966
2011-01-12   -1.361908
dtype: float64

'datetime，Series，DatetimeIndex，"<M8[ns]",datetime64[ns],Timestamp'
type(ts)
#书本上的类型是TimeSeries，这里是 Series  note: output changed to "pandas.core.series.Series"

pandas.core.series.Series

ts.index

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

ts + ts[::2]

2011-01-02   -3.594832
2011-01-05         NaN
2011-01-07    0.495513
2011-01-08         NaN
2011-01-10   -1.509931
2011-01-12         NaN
dtype: float64

'pandas用Numpy的datetime64数据类型以纳秒形式存储时间戳：'
ts.index.dtype
# note: output changed from dtype('datetime64[ns]') to dtype('<M8[ns]')

dtype('<M8[ns]')

stamp = ts.index[0]
stamp
# note: output changed from <Timestamp: 2011-01-02 00:00:00> to Timestamp('2011-01-02 00:00:00')

Timestamp('2011-01-02 00:00:00')

只要有需要，TimeStamp可以随时自动转换为datetime对象。此外，它还可以存储频率信息（如果有的话），且知道如何执行时区转换以及其他操作。稍后进行讲解。

索引，选取，子集构造¶

Indexing, selection, subsetting
索引传入字符串日期，datetime，Timestamp,等价的实例化方法（after= ）

'传入索引取值'
stamp = ts.index[2]
ts[stamp]

0.2477565022039801

stamp

Timestamp('2011-01-07 00:00:00')

'传入可以表示为日期的字符串取值'
ts['1/10/2011']

-0.75496551855551064

ts['20110110']

-0.75496551855551064

'传入 年，年月 即可以对数据切片'
longer_ts = Series(np.random.randn(1000),
                   index=pd.date_range('1/1/2000', periods=1000))
longer_ts

2000-01-01    0.320376
2000-01-02   -0.882328
2000-01-03    0.468442
2000-01-04    1.572764
2000-01-05    0.654120
2000-01-06   -0.605270
                ...   
2002-09-21   -0.170869
2002-09-22   -0.595996
2002-09-23   -0.324581
2002-09-24   -2.023415
2002-09-25   -0.372641
2002-09-26    0.064324
Freq: D, dtype: float64

longer_ts['2001']

2001-01-01   -1.055323
2001-01-02    0.517216
2001-01-03    1.844480
2001-01-04   -0.726423
2001-01-05    0.223081
2001-01-06    1.125942
                ...   
2001-12-26    0.133168
2001-12-27   -0.285898
2001-12-28    0.606074
2001-12-29   -1.102832
2001-12-30    0.744024
2001-12-31   -0.205663
Freq: D, dtype: float64

longer_ts['2001-05']

2001-05-01   -0.331558
2001-05-02    0.056968
2001-05-03    0.591785
2001-05-04   -0.351607
2001-05-05   -1.117959
2001-05-06   -3.223271
                ...   
2001-05-26   -0.100130
2001-05-27    0.948797
2001-05-28   -0.414460
2001-05-29   -0.908078
2001-05-30   -0.880546
2001-05-31    1.167413
Freq: D, dtype: float64

ts

2011-01-02   -1.797416
2011-01-05   -0.767166
2011-01-07    0.247757
2011-01-08    0.943761
2011-01-10   -0.754966
2011-01-12   -1.361908
dtype: float64

'日期切片的方式只对规则Series有效'
ts[datetime(2011, 1, 7):]

2011-01-07    0.247757
2011-01-08    0.943761
2011-01-10   -0.754966
2011-01-12   -1.361908
dtype: float64

'传入指定范围日期进行切片，这个范围可以大于索引的时间范围'
ts['1/6/2011':'1/11/2011']

2011-01-07    0.247757
2011-01-08    0.943761
2011-01-10   -0.754966
dtype: float64

'x.truncate(after= ): 时间的终值值'
ts.truncate(after='1/9/2011')

2011-01-02   -1.797416
2011-01-05   -0.767166
2011-01-07    0.247757
2011-01-08    0.943761
dtype: float64

'上面这些操作对DataFrame也有效'
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
long_df = DataFrame(np.random.randn(100, 4),
                    index=dates,
                    columns=['Colorado', 'Texas', 'New York', 'Ohio'])
long_df.ix['5-2001']

带有重复索引的时间序列¶

Time series with duplicate indices

dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/2/2000',
                          '1/3/2000'])
dup_ts = Series(np.arange(5), index=dates)
dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32

'x.index.is_unique:  检查索引值是否都是唯一的。'
dup_ts.index.is_unique

False

dup_ts['1/3/2000']  # 不重复 not duplicated

4

dup_ts['1/2/2000']  # 重复 duplicated

2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32

'对有重复索引的数据进行聚合：使用groupby，传入（level=0)(索引的唯一一层)'
grouped = dup_ts.groupby(level=0)
grouped.mean()

2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32

grouped.count()

2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64

日期的范围，频率以及移动¶

Date ranges, Frequencies, and Shifting

ts

2011-01-02   -1.797416
2011-01-05   -0.767166
2011-01-07    0.247757
2011-01-08    0.943761
2011-01-10   -0.754966
2011-01-12   -1.361908
dtype: float64

'x.resample("D"):将时间序列 x 转换成一个具有固定频率（每日）的时间序列'
ts.resample('D')

2011-01-02   -1.797416
2011-01-03         NaN
2011-01-04         NaN
2011-01-05   -0.767166
2011-01-06         NaN
2011-01-07    0.247757
2011-01-08    0.943761
2011-01-09         NaN
2011-01-10   -0.754966
2011-01-11         NaN
2011-01-12   -1.361908
Freq: D, dtype: float64

生成日期范围¶

Generating date ranges

'pd.date_range():用于生成指定长度的DatetimeIndex：'
index = pd.date_range('4/1/2012', '6/1/2012')
index

DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
               '2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
               '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

'默认下，date_range会产生一个按天计算的时间点。如果只传入start= 或 end= （两者只取其中一个），那还得传入 periods= (表示时间段)'
pd.date_range(start='4/1/2012', periods=20)

DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
              dtype='datetime64[ns]', freq='D')

pd.date_range(end='6/1/2012', periods=20)

DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
               '2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
               '2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
               '2012-05-25', '2012-05-26', '2012-05-27', '2012-05-28',
               '2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

'freq="BM" 传入频率 “BM”(表示business end of month)，即每月最后一个工作日'
pd.date_range('1/1/2000', '12/1/2000', freq='BM')

DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-29', '2000-10-31', '2000-11-30'],
              dtype='datetime64[ns]', freq='BM')

'pd.date_range(): 默认保留起始和结束时间戳的时间信息（如果有的话）'
pd.date_range('5/2/2012 12:56:31', periods=5)

DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D')

'normalize=True 产生一组被规范化到午夜的时间戳'
pd.date_range('5/2/2012 12:56:31', periods=5, normalize=True)

DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D')

频率和日期偏移量¶

Frequencies and Date Offsets
pandas中的频率是有一个基础频率（base frequency)和一个乘数组成的。基础频率通常以一个字符串别名表示，比如"M"表示每月，"H"表示每小时。对于每个基础频率，都有一个被称为日期偏移量（date offset）的对象与之对应。如 "H" 对应 Hour() 。

"导入 日期偏移量（date offset）对象"
from pandas.tseries.offsets import Hour, Minute
hour = Hour()
hour

<Hour>

four_hours = Hour(4)
four_hours

<4 * Hours>

'对 freq= 使用基础频率 "4h"'
pd.date_range('1/1/2000', '1/3/2000 23:59', freq='4h')

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
               '2000-01-01 08:00:00', '2000-01-01 12:00:00',
               '2000-01-01 16:00:00', '2000-01-01 20:00:00',
               '2000-01-02 00:00:00', '2000-01-02 04:00:00',
               '2000-01-02 08:00:00', '2000-01-02 12:00:00',
               '2000-01-02 16:00:00', '2000-01-02 20:00:00',
               '2000-01-03 00:00:00', '2000-01-03 04:00:00',
               '2000-01-03 08:00:00', '2000-01-03 12:00:00',
               '2000-01-03 16:00:00', '2000-01-03 20:00:00'],
              dtype='datetime64[ns]', freq='4H')

'偏移量对象的 + - 运算'
Hour(2) + Minute(30)

<150 * Minutes>

'freq= 传入字符串的等效表达式'
pd.date_range('1/1/2000', periods=10, freq='1h30min')

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00', '2000-01-01 07:30:00',
               '2000-01-01 09:00:00', '2000-01-01 10:30:00',
               '2000-01-01 12:00:00', '2000-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')

WOM日期¶

Week of month dates
WOM是一种非常实用的频率类，使用它，你可以获得诸如‘每月第3个星期五’之类的日期

rng = pd.date_range('1/1/2012', '9/1/2012', freq='WOM-3FRI')
list(rng)

[Timestamp('2012-01-20 00:00:00', offset='WOM-3FRI'),
 Timestamp('2012-02-17 00:00:00', offset='WOM-3FRI'),
 Timestamp('2012-03-16 00:00:00', offset='WOM-3FRI'),
 Timestamp('2012-04-20 00:00:00', offset='WOM-3FRI'),
 Timestamp('2012-05-18 00:00:00', offset='WOM-3FRI'),
 Timestamp('2012-06-15 00:00:00', offset='WOM-3FRI'),
 Timestamp('2012-07-20 00:00:00', offset='WOM-3FRI'),
 Timestamp('2012-08-17 00:00:00', offset='WOM-3FRI')]

表10-4：时间序列的基础频率¶

别名    偏移量类型     说明`
D      Day        每日历日
B      BusinessDay   每工作日
H      Hour        每小时
T或min   Minute      每分
S      Second      每秒
L或ms   Milli       每毫秒（即每千分之一秒）
U      Micro       每微妙（即每百万分之一秒）
M      MonthEnd     每月最后一个日历日
BM     BusinessMonthEnd 每月最后一个工作日
MS     MonthBegin    每月第一个日历日
BMS     BusinessMonthBegin 每月第一个工作日
W-MON,W-TUE...  Week   从每周指定的星期几（MON，TUE，WED，THU，FRI,SAT,SUN）。
WOM-1MON，WOM-2TUE... WeekOfMonth  产生每月第一，第二，第三或第四周的星期几。例如，WOM-3FRI表示每月第三个星期五
Q-JAN,Q-FEB... QuarterEnd  对于以指定月份（JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC）结束的年度，每季度最后一月的最后一个日历日
BQ-JAN,BQ-FEB... BusinessQuarterEnd  对于以指定月份结束的年度，每季度最后一月的最后一个工作日。
QS-JAN,QS-FEB... QuarterBegin  对于以指定月份结束的年度，每季度最后一月的第一个日历日
BQS-JAN,BQS-FEB... BusinessQuarterBegin  对于以指定月份结束的年度，每季度最后一月的第一个工作日
A-JAN,A-FEB...  YearEnd  每年指定月份（JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC）的最后一个日历日
BA-JAN,BA-FEB... BusinessYearEnd  每年指定月份的最后一个工作日
AS-JAN,AS-FEB... YearBgein  每年指定月份的第一个日历日
BAS-JAN,BAS-FEB... BusinessYearBegin 每年指定月份的第一个工作日

移动（超前和滞后）数据 shift 方法¶

Shifting (leading and lagging) data

ts = Series(np.random.randn(4),
            index=pd.date_range('1/1/2000', periods=4, freq='M'))
ts

2000-01-31    0.989349
2000-02-29   -0.894813
2000-03-31    1.490474
2000-04-30   -0.893033
Freq: M, dtype: float64

'默认是针对数据的移动（不影响索引）。这里（传入正值）是向后移动'
ts.shift(2)

2000-01-31         NaN
2000-02-29         NaN
2000-03-31    0.989349
2000-04-30   -0.894813
Freq: M, dtype: float64

ts.shift(-2)

2000-01-31    1.490474
2000-02-29   -0.893033
2000-03-31         NaN
2000-04-30         NaN
Freq: M, dtype: float64

'shift通常用于计算一个时间序列或多个时间序列（如DataFrame的列）中的百分比变化。可以这样表达：'
ts / ts.shift(1) - 1

2000-01-31         NaN
2000-02-29   -1.904446
2000-03-31   -2.665681
2000-04-30   -1.599161
Freq: M, dtype: float64

'由于单纯的移位操作不会修改索引，所以部分数据会被丢弃。因此如果频率已知，可以通过如下方式保留完整数据。'
'freq="M",传递指定的频率可以对时间戳进行位移，而不是对数据进行简单的移动。'
ts.shift(2, freq='M')

2000-03-31   -0.175779
2000-04-30   -1.085392
2000-05-31    0.894497
2000-06-30    0.755683
Freq: M, dtype: float64

'下面是传递其他的 频率， 这样你就可以对数据进行非常灵活的超前喝滞后处理了。'
ts.shift(3, freq='D')

2000-02-03   -0.175779
2000-03-03   -1.085392
2000-04-03    0.894497
2000-05-03    0.755683
dtype: float64

ts.shift(1, freq='3D')

2000-02-03   -0.175779
2000-03-03   -1.085392
2000-04-03    0.894497
2000-05-03    0.755683
dtype: float64

ts.shift(1, freq='90T')

2000-01-31 01:30:00   -0.175779
2000-02-29 01:30:00   -1.085392
2000-03-31 01:30:00    0.894497
2000-04-30 01:30:00    0.755683
Freq: M, dtype: float64

通过偏移量对日期进行位移¶

Shifting dates with offsets

'pandas 的日期偏移量还可以用在datetime或Timestamp对象上'
from pandas.tseries.offsets import Day, MonthEnd
now = datetime(2011, 11, 17)
now + 3 * Day()

Timestamp('2011-11-20 00:00:00')

'瞄点偏移量的用法：第一次增量会到达本月（或本年，本周）的对应日期'
now + MonthEnd()

Timestamp('2011-11-30 00:00:00')

'瞄点偏移量的用法：第二次增量则按正常的增量来。'
now + MonthEnd(2)

Timestamp('2011-12-31 00:00:00')

'通过瞄点偏移量的rollforward和rollback方法，可以显式地将日期向前或向后“滚动”'
offset = MonthEnd()
offset.rollforward(now)

Timestamp('2011-11-30 00:00:00')

offset.rollback(now)

Timestamp('2011-10-31 00:00:00')

'日期偏移量还有一个巧妙的用法，即结合groupby使用这两个“滚动”方法'
ts = Series(np.random.randn(20),
            index=pd.date_range('1/15/2000', periods=20, freq='4d'))
ts.groupby(offset.rollforward).mean()

2000-01-31    0.109940
2000-02-29   -0.304775
2000-03-31   -0.960531
dtype: float64

'更简单，更快速的实现该功能的办法是使用resample（稍后将对此进行详细分析）'
ts.resample('M', how='mean')

2000-01-31    0.109940
2000-02-29   -0.304775
2000-03-31   -0.960531
Freq: M, dtype: float64

时区处理¶

Time Zone Handling
Python的时区信息来自第三方库pytz,由于pandas包装了pytz的功能，因此你可以不用记忆其API，只要记得时区的名称即可。

import pytz
pytz.common_timezones[-5:]

['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']

tz = pytz.timezone('US/Eastern')
tz
'pandas中的方法既可以接受时区名称，也可以接受这种对象。我建议只用时区名。'

<DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>

本地化和转换¶

Localization and Conversion

'默认下，pandas中的时间序列是单纯的（naive）时区。'
rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')
ts = Series(np.random.randn(len(rng)), index=rng)

print(ts.index.tz)

None

'pd.date_range(tz= ):在生成日趋范围的时候还可以加上一个时区集。'
pd.date_range('3/9/2012 9:30', periods=10, freq='D', tz='UTC')

DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00', '2012-03-16 09:30:00+00:00',
               '2012-03-17 09:30:00+00:00', '2012-03-18 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

'x.tz_localize(): 从 单纯 到 本地化 的转换。'
ts_utc = ts.tz_localize('UTC')
ts_utc

2012-03-09 09:30:00+00:00   -0.973504
2012-03-10 09:30:00+00:00   -1.009640
2012-03-11 09:30:00+00:00    0.458644
2012-03-12 09:30:00+00:00    0.501971
2012-03-13 09:30:00+00:00    0.053265
2012-03-14 09:30:00+00:00   -0.783983
Freq: D, dtype: float64

ts_utc.index

DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

'x.tz_convert(): 一旦时间序列被转化到某个特定时区，就可以用 tz_convert 将其转换到别的时区了。'
ts_utc.tz_convert('US/Eastern')

2012-03-09 04:30:00-05:00   -0.973504
2012-03-10 04:30:00-05:00   -1.009640
2012-03-11 05:30:00-04:00    0.458644
2012-03-12 05:30:00-04:00    0.501971
2012-03-13 05:30:00-04:00    0.053265
2012-03-14 05:30:00-04:00   -0.783983
Freq: D, dtype: float64

'对于上面的这种时间序列（他跨越了美国东部时区的夏令时转变期），我们可以将其本地化到EST，然后转化为UTC或柏林时间：'
ts_eastern = ts.tz_localize('US/Eastern')
ts_eastern.tz_convert('UTC')

2012-03-09 14:30:00+00:00   -0.973504
2012-03-10 14:30:00+00:00   -1.009640
2012-03-11 13:30:00+00:00    0.458644
2012-03-12 13:30:00+00:00    0.501971
2012-03-13 13:30:00+00:00    0.053265
2012-03-14 13:30:00+00:00   -0.783983
Freq: D, dtype: float64

ts_eastern.tz_convert('Europe/Berlin')

2012-03-09 15:30:00+01:00   -0.973504
2012-03-10 15:30:00+01:00   -1.009640
2012-03-11 14:30:00+01:00    0.458644
2012-03-12 14:30:00+01:00    0.501971
2012-03-13 14:30:00+01:00    0.053265
2012-03-14 14:30:00+01:00   -0.783983
Freq: D, dtype: float64

'tz_localize,tz_convert 也是 DatetimeIndex的实例方法'
ts.index.tz_localize('Asia/Shanghai')

DatetimeIndex(['2012-03-09 09:30:00+08:00', '2012-03-10 09:30:00+08:00',
               '2012-03-11 09:30:00+08:00', '2012-03-12 09:30:00+08:00',
               '2012-03-13 09:30:00+08:00', '2012-03-14 09:30:00+08:00'],
              dtype='datetime64[ns, Asia/Shanghai]', freq='D')

操作时区意识型Timestamp对象¶

Operations with time zone-aware Timestamp objects
和时间序列和日期范围差不多，Timestamp对象也能被从单纯性（naive）本地化为时区意识型（time zone-aware），并从一个时区转换到另一个时区：

stamp = pd.Timestamp('2011-03-12 04:00')
stamp_utc = stamp.tz_localize('utc')
stamp_utc.tz_convert('US/Eastern')

Timestamp('2011-03-11 23:00:00-0500', tz='US/Eastern')

'创建Timestamp时，还可以传入一个时区信息：'
stamp_moscow = pd.Timestamp('2011-03-12 04:00', tz='Europe/Moscow')
stamp_moscow

Timestamp('2011-03-12 04:00:00+0300', tz='Europe/Moscow')

'时区意识型Timestamp对象在内部保存了一个UTC时间戳值（自UNIX纪元（1970年1月1日）算起的纳秒数）。这个UTC值在时区转换过程中是不会发生变化的。'
stamp_utc.value

1299902400000000000

stamp_utc.tz_convert('US/Eastern').value

1299902400000000000

'使用pandas的DateOffset对象执行时间算术运算时，运算过程会自动关注是否存在夏令时转变期：'
'这里的 30 minutes 我认为应该是 90 minutes '
# 30 minutes before DST transition
from pandas.tseries.offsets import Hour
stamp = pd.Timestamp('2012-03-12 01:30', tz='US/Eastern')
stamp

Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')

stamp + Hour()

Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')

# 90 minutes before DST transition
stamp = pd.Timestamp('2012-11-04 00:30', tz='US/Eastern')
stamp

Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')

stamp + 2 * Hour()

Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')

不同时区之间的运算¶

Operations between different time zones

'如果两个时间序列的时区不同，在将他们合并到一起时，最终结果就会是UTC。由于时间戳其实是UTC存储的，所以这是一个很简单的运算，并不需要发生任何转换'
rng = pd.date_range('3/7/2012 9:30', periods=10, freq='B')
ts = Series(np.random.randn(len(rng)), index=rng)
ts

2012-03-07 09:30:00    0.691883
2012-03-08 09:30:00    0.549357
2012-03-09 09:30:00   -0.775694
2012-03-12 09:30:00    0.063923
2012-03-13 09:30:00   -0.808317
2012-03-14 09:30:00   -0.897806
2012-03-15 09:30:00   -1.893785
2012-03-16 09:30:00   -0.376115
2012-03-19 09:30:00   -0.736955
2012-03-20 09:30:00    0.555930
Freq: B, dtype: float64

ts1 = ts[:7].tz_localize('Europe/London')
ts2 = ts1[2:].tz_convert('Europe/Moscow')
result = ts1 + ts2
result.index

DatetimeIndex(['2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00',
               '2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='B')

result

2012-03-07 09:30:00+00:00         NaN
2012-03-08 09:30:00+00:00         NaN
2012-03-09 09:30:00+00:00   -1.551388
2012-03-12 09:30:00+00:00    0.127845
2012-03-13 09:30:00+00:00   -1.616634
2012-03-14 09:30:00+00:00   -1.795612
2012-03-15 09:30:00+00:00   -3.787569
Freq: B, dtype: float64

时期及其算术运算¶

Periods and Period Arithmetic
时期（period）表示的是时间区间，比如数日，数月，数季，数年等。period类所表示的就是这种数据类型，其构造函数需要用到一个字符串或整数，以及表10-4中的频率：

'用Period对象表示 2007年1月 - 2007年 12月 31日 之间的整段时间。'
p = pd.Period(2007, freq='A-DEC')
p

Period('2007', 'A-DEC')

'对period 加上或减去一个整数即可达到根据其频率进行位移的效果。'
p + 5

Period('2012', 'A-DEC')

p - 2

Period('2005', 'A-DEC')

'如果两个period对象有相同的频率，那么他们的差就是它们之间的单位数量'
pd.Period('2014', freq='A-DEC') - p

7

'pd.period_range()函数可用于创建规则的时期范围。'
rng = pd.period_range('1/1/2000', '6/30/2000', freq='M')
rng

PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='int64', freq='M')

'注意与 pd.date_range() 的区别：'
'一个是 PeriodIndex 一个是 DatetimeIndex'
'一个的 dtype 类型是 int64, 一个的 dtype 类型是 datetime64[ns]'
pd.date_range('1/1/2000', '6/30/2000', freq='M')

DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-30',
               '2000-05-31', '2000-06-30'],
              dtype='datetime64[ns]', freq='M')

'PeriodIndex 类保存了一组Period，他可以在任何pandas数据结构中被用作轴索引：'
Series(np.random.randn(6), index=rng)

2000-01    0.556152
2000-02   -1.100281
2000-03    0.570228
2000-04   -0.912423
2000-05    1.091271
2000-06   -0.247586
Freq: M, dtype: float64

'PeriodIndex 类的构造函数还允许直接使用一组字符串：'
values = ['2001Q3', '2002Q2', '2003Q1']
index = pd.PeriodIndex(values, freq='Q-DEC')
index

PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='int64', freq='Q-DEC')

时期的频率转换¶

Period Frequency Conversion
asfreq: period和PeriodIndex对象都可以通过其asfreq方法被转换成别的频率。

'把年度时期转换为当年年初或年末的一个月度时期。'
p = pd.Period('2007', freq='A-DEC')
p.asfreq('M', how='start')

Period('2007-01', 'M')

p.asfreq('M', how='end')

Period('2007-12', 'M')

你可以将Period('2007','A-DEC')看做一个被划分为多个月度时期的时间段中的游标。
但是对于一个不以12月结束的财政年度，月度子时期的归属情况就不一样了。

p = pd.Period('2007', freq='A-JUN')
p.asfreq('M', 'start')

Period('2006-07', 'M')

p.asfreq('M', 'end')

Period('2007-06', 'M')

'再将高频率转换为低频率时，超时期（superperiod）是由子时期（subperiod）所属的位置决定的。'
'例如，在A-JUN频率中，月份‘2007年8月’实际上是属于周期‘2008年’的'
p = pd.Period('Aug-2007', 'M')
p.asfreq('A-JUN')

Period('2008', 'A-JUN')

'Period或Series的频率转换方式也是如此：'
rng = pd.period_range('2006', '2009', freq='A-DEC')
ts = Series(np.random.randn(len(rng)), index=rng)
ts

2006    0.031403
2007    0.521739
2008    0.827226
2009    1.129234
Freq: A-DEC, dtype: float64

ts.asfreq('M', how='start')

2006-01    0.031403
2007-01    0.521739
2008-01    0.827226
2009-01    1.129234
Freq: M, dtype: float64

ts.asfreq('B', how='end')

2006-12-29    0.031403
2007-12-31    0.521739
2008-12-31    0.827226
2009-12-31    1.129234
Freq: B, dtype: float64

按季度计算的时期频率¶

Quarterly period frequencies
季度性数据在会计、金融等领域中很常见。许多季度性数据都会涉及‘财年末’的概念，通常是一年12个月中某月的最后一个日历日或工作日。就这一点来说，时期‘2012Q4’根据财年末的不同会有不同的含义。pandas支持12种可能的季度性频率，即Q-JAN 到 Q-DEC:

'再以1月结束的财年中，2012Q4是从11月到1月'
p = pd.Period('2012Q4', freq='Q-JAN')
p

Period('2012Q4', 'Q-JAN')

p.asfreq('D', 'start')

Period('2011-11-01', 'D')

p.asfreq('D', 'end')

Period('2012-01-31', 'D')

'因此，Period之间的算术运算会非常简单。例如，要获取该季度倒数第二个工作日下午4点的时间戳，你可以这样：'
p4pm = (p.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
p4pm

Period('2012-01-30 16:00', 'T')

p4pm.to_timestamp()

Timestamp('2012-01-30 16:00:00')

'pd.period_range(): 还可以用于生成季度型范围。'
rng = pd.period_range('2011Q3', '2012Q4', freq='Q-JAN')
ts = Series(np.arange(len(rng)), index=rng)
ts

2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32

'季度型范围的算术运算也跟上面是一样的：'
new_rng = (rng.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
ts.index = new_rng.to_timestamp()
ts

2010-10-28 16:00:00    0
2011-01-28 16:00:00    1
2011-04-28 16:00:00    2
2011-07-28 16:00:00    3
2011-10-28 16:00:00    4
2012-01-30 16:00:00    5
dtype: int32

将Timestamp转换为Period（及其反向过程）¶

Converting Timestamps to Periods (and back)

'x.to_period(): 可以将由时间戳索引的Series和DataFrame对象转换为以时期索引。'
rng = pd.date_range('1/1/2000', periods=3, freq='M')
ts = Series(randn(3), index=rng)
pts = ts.to_period()
ts

2000-01-31    0.681821
2000-02-29    0.166583
2000-03-31    1.783375
Freq: M, dtype: float64

pts

2000-01    0.681821
2000-02    0.166583
2000-03    1.783375
Freq: M, dtype: float64

由于时期指的是非重叠时间区间，因此对于给定的频率，一个时间戳只能属于一个时期。新PeriodIndex的频率默认是从时间戳推断而来的，你也可以指定任何特别的频率，在结果中允许存在重复时期：

rng = pd.date_range('1/29/2000', periods=6, freq='D')
ts2 = Series(randn(6), index=rng)
ts2.to_period('M')

2000-01   -0.433075
2000-01   -1.532032
2000-01    1.765854
2000-02   -1.526148
2000-02   -0.974556
2000-02    0.086508
Freq: M, dtype: float64

ts2.to_period()

2000-01-29   -0.433075
2000-01-30   -1.532032
2000-01-31    1.765854
2000-02-01   -1.526148
2000-02-02   -0.974556
2000-02-03    0.086508
Freq: D, dtype: float64

pts = ts.to_period()
pts

2000-01    0.681821
2000-02    0.166583
2000-03    1.783375
Freq: M, dtype: float64

'x.to_timestamp(): 转换为时间戳'
pts.to_timestamp(how='end')

2000-01-31    0.681821
2000-02-29    0.166583
2000-03-31    1.783375
Freq: M, dtype: float64

通过数组创建PeriodIndex¶

Creating a PeriodIndex from arrays

'固定频率的数据集通常会将时间信息分开存放在多个列中。例如，在下面这个宏观经济数据集中，年度和季度就分别存放在不同的列中。'
data = pd.read_csv('ch08/macrodata.csv')
data.year

0      1959
1      1959
2      1959
3      1959
4      1960
5      1960
       ... 
197    2008
198    2008
199    2008
200    2009
201    2009
202    2009
Name: year, dtype: float64

data.quarter

0      1
1      2
2      3
3      4
4      1
5      2
      ..
197    2
198    3
199    4
200    1
201    2
202    3
Name: quarter, dtype: float64

'将这两个数组以及一个频率传入PeriodIndex，就可以将它们合并成DataFrame的一个索引。'
index = pd.PeriodIndex(year=data.year, quarter=data.quarter, freq='Q-DEC')
index

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='int64', length=203, freq='Q-DEC')

data.index = index
data.infl

1959Q1    0.00
1959Q2    2.34
1959Q3    2.74
1959Q4    0.27
1960Q1    2.31
1960Q2    0.14
          ... 
2008Q2    8.53
2008Q3   -3.16
2008Q4   -8.79
2009Q1    0.94
2009Q2    3.37
2009Q3    3.56
Freq: Q-DEC, Name: infl, dtype: float64

重采样及频率转换¶

Resampling and Frequency Conversion
重采样（resampling)指的是将时间序列从一个频率转换到另一个频率的处理过程。将高频率数据聚合到低频率称为降采样（downsampling），而将低频率数据转换到高频率则称为升采样（upsampling)。并不是所有的重采样都能被划分到这两大类中。例如将W-WED（每周三）转换为W-FRI既不是降采样也不是升采样。

'pandas对象都带有一个resample方法，它是各种频率转换工作的主力函数：'
rng = pd.date_range('1/1/2000', periods=100, freq='D')
ts = Series(randn(len(rng)), index=rng)
ts.resample('M', how='mean')

2000-01-31   -0.191423
2000-02-29    0.132256
2000-03-31    0.071156
2000-04-30    0.355905
Freq: M, dtype: float64

ts.resample('M', how='mean', kind='period')

2000-01   -0.191423
2000-02    0.132256
2000-03    0.071156
2000-04    0.355905
Freq: M, dtype: float64

表10-5：resample方法的参数¶

参数        说明
freq      表示重采样频率的字符串或DateOffset，例如'M','5min'或'Second(15)'
how='mean'  用于产生聚合值的函数名或数组函数，例如’mean'，‘ohic',np.max等。默认为'mean'。其他常用的值有：'first','last','median'，'ohic','max','min'
axis=0     重采样的轴，默认为axis=0
fill_method=None  升采样时如何插值，比如'ffill'或'bfill'。默认不插值
closed='left'  再降采样中，各时间段的哪一端是闭合（即包含）的，'right'或'left'。默认为'left'
label='left'  再降采样中，如何设置聚合值的标签，'right'或'left'（面元的右边界或左边界）。例如，9:30到9:35之间的这5分钟会被标记为9:30或9:35.默认为'left'（本例就是9:30)
loffset=None  面元标签的世间校正值，比如'-ls'/Second(-1)用于将聚合标签调早1秒
limit=None  在向前或向后填充时，允许填充的最大是期数。
kind=None  聚合到期（'period')或时间戳（'timestamp'），默认聚合到时间序列的索引类型。
convention=None  当重采样时期时，将低频率转换到高频率所采用的约定（'start'或'end'）。默认为'end'。

降采样¶

Downsampling
在用resample对数据进行降采样时，需要考虑两样东西：
1、各区间哪边是闭合的（默认：closed='left'） 2、如何标记各个聚合面元，用区间的开头还是末尾（默认：label='left'）

rng = pd.date_range('1/1/2000', periods=12, freq='T')
ts = Series(np.arange(12), index=rng)
ts

2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32

'通过 sum 方式将数据聚合到 5分钟 块中'
'默认输出形式是： closed="left",label="left" '
ts.resample('5min', how='sum')
# note: output changed (as the default changed from closed='right', label='right' to closed='left', label='left'

2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32

'变为右边闭合（closed=right）。'
ts.resample('5min', how='sum', closed='right')

1999-12-31 23:55:00     0
2000-01-01 00:00:00    15
2000-01-01 00:05:00    40
2000-01-01 00:10:00    11
Freq: 5T, dtype: int32

ts.resample('5min', how='sum', closed='right', label='right')

2000-01-01 00:00:00     0
2000-01-01 00:05:00    15
2000-01-01 00:10:00    40
2000-01-01 00:15:00    11
Freq: 5T, dtype: int32

'loffset="-1s",对索引做位移，这里是减去一秒。'
ts.resample('5min', how='sum', loffset='-1s')

1999-12-31 23:59:59    10
2000-01-01 00:04:59    35
2000-01-01 00:09:59    21
Freq: 5T, dtype: int32

OHLC重采样¶

Open-High-Low-Close (OHLC) resampling

'how="ohlc"可以得到各面元的四个值：开盘价（open），收盘价（close），最大值（high），最小值（low）'
ts.resample('5min', how='ohlc')
# note: output changed because of changed defaults

通过groupby进行重采样¶

Resampling with GroupBy

'例如，你打算根据月份或星期几进行分组，只需要传入一个能够访问时间序列的索引上的这些字段的函数即可：'
rng = pd.date_range('1/1/2000', periods=100, freq='D')
ts = Series(np.arange(100), index=rng)
ts.groupby(lambda x: x.month).mean()

1    15
2    45
3    75
4    95
dtype: int32

ts.groupby(lambda x: x.weekday).mean()

0    47.5
1    48.5
2    49.5
3    50.5
4    51.5
5    49.0
6    50.0
dtype: float64

升采样和插值¶

Upsampling and interpolation

frame = DataFrame(np.random.randn(2, 4),
                  index=pd.date_range('1/1/2000', periods=2, freq='W-WED'),
                  columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame

'低频到高频，默认会引入缺失值'
df_daily = frame.resample('D')
df_daily

'fill_method="ffill",resample的填充和插值方式跟fillna和reindex的一样。'
frame.resample('D', fill_method='ffill')

'limit=2，填充指定时期数。'
frame.resample('D', fill_method='ffill', limit=2)

'注意：新的日期索引完全没有必要跟旧的相交。'
frame.resample('W-THU', fill_method='ffill')

frame.resample('W-THU')

通过时期进行重采样¶

Resampling with periods

frame = DataFrame(np.random.randn(24, 4),
                  index=pd.period_range('1-2000', '12-2001', freq='M'),
                  columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame[:5]

'降采样 月 -> 年'
annual_frame = frame.resample('A-DEC', how='mean')
annual_frame

'升采样要稍微麻烦一些，因为你必须决定在新频率中各区间的哪端用于放置原来的值，就像asfreq方法那样。--- '
'--- convention参数默认值为"start"，可以设置为"end"'
# Q-DEC: 季度型（每年以12月结束）Quarterly, year ending in December
annual_frame.resample('Q-DEC', fill_method='ffill')
# note: output changed, default value changed from convention='end' to convention='start' + 'start' changed to span-like
# also the following cells

'convention="end"  可以理解为以 2000 年的最后一个 “Q-DEC” 作为起始点采样。'
annual_frame.resample('Q-DEC', fill_method='ffill', convention='end')

由于时期指的是时间区间，所以升采样和降采样的规则就比较严格：
1) 在降采样中，目标频率必须是源频率的子时期（subperiod）。
2) 在升采样中，目标频率必须是源频率的超时期（superperiod）。
如果不满足这些条件，就会发生异常。这主要影响的是按季，年，周计算的频率。例如，由Q-MAR定义的时间区间只能升采样为A-MAR，A-JUN,A-SEP,A-DEC等：

'####????这里不懂为什么会这样？'
annual_frame.resample('Q-MAR', fill_method='ffill')

时间序列绘图¶

Time series plotting

close_px_all = pd.read_csv('ch09/stock_px.csv', parse_dates=True, index_col=0)
close_px = close_px_all[['AAPL', 'MSFT', 'XOM']]
close_px = close_px.resample('B', fill_method='ffill')
close_px.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2292 entries, 2003-01-02 to 2011-10-14
Freq: B
Data columns (total 3 columns):
AAPL    2292 non-null float64
MSFT    2292 non-null float64
XOM     2292 non-null float64
dtypes: float64(3)
memory usage: 71.6 KB

'pandas的时间序列的绘图功能在日期格式化方面比matplotlib原生的要好。对其中任意一列调用plot即可生成一张简单的图表：'
close_px['AAPL'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x861a9e8>

'当对DataFrame调用plot时，所有时间序列都会被绘制在一个subplot上，并有一个图例说明哪个是哪个。'
close_px.ix['2009'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x8808198>

close_px['AAPL'].ix['01-2011':'03-2011'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x8b76e80>

'季度型频率的数据会用季度标记进行格式化，这种事情如果纯手工做的话那是很费精力的。'
appl_q = close_px['AAPL'].resample('Q-DEC', fill_method='ffill')
appl_q.ix['2009':].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x8c95588>

pandas时间序列在绘图方面还有一个特点：当右键点击并拖拉（放大、缩小）是，日期会动态展开或收缩，且绘图窗口中的时间区间会被重新格式化。当然，只有在交互模式下使用matplotlib才会有此效果。

移动窗口函数¶

Moving window functions
在移动窗口（可以带有指数衰减权数）上计算的各种统计函数也是一类常见于时间序列的数组变换。我将它们称为移动窗口函数（moving window function），其中还包括那些窗口不定长的函数（如指数加权移动平均）。跟其他统计函数一样，移动窗口函数也会自动排除缺失值。

close_px = close_px.asfreq('B').fillna(method='ffill')

结果如下图所示，默认情况下，诸如rolling_mean这样的函数需要指定数量（译注：这是针对窗口而言的，即一个窗口里面必须有多少个非NA值）的非NA观测值。可以修改该行为以解决缺失值的问题。其实，在时间序列开始处尚不足窗口期的那些数据就是个特例（见下下图）：

'苹果公司股价的250日均线'
close_px.AAPL.plot()
pd.rolling_mean(close_px.AAPL, 250).plot()

<matplotlib.axes._subplots.AxesSubplot at 0x9cf1358>

plt.figure()

<matplotlib.figure.Figure at 0x8b8af98>

<matplotlib.figure.Figure at 0x8b8af98>

'min_periods=10 最小10天可以看到标准差数据,由于10<250,则第一个数据是10日标准差，第二个数据是11日标准差，以此类推一直到250日--- '
'--- 标准差然后停止递增,里250 指的是 window=250'
appl_std250 = pd.rolling_std(close_px.AAPL, 250, min_periods=10)
appl_std250[5:12]

2003-01-09         NaN
2003-01-10         NaN
2003-01-13         NaN
2003-01-14         NaN
2003-01-15    0.077496
2003-01-16    0.074760
2003-01-17    0.112368
Freq: B, Name: AAPL, dtype: float64

test = close_px.AAPL[:10]
test.std()

0.0774955195837512

test = close_px.AAPL[:12]
test.std()

0.11236776740469294

appl_std250[248:251]

2003-12-16    1.677508
2003-12-17    1.674754
2003-12-18    1.671492
Freq: B, Name: AAPL, dtype: float64

test = close_px.AAPL[:251]
print(test[0:249].std(), '\n', test[0:250].std(), '\n', test[1:251].std(), '\n', test[0:251].std()) #最后一个数是251日标准差。

1.6775075502552474 
 1.6747539650269512 
 1.671492294797157 
 1.672157233499834

'苹果公司250日每日回报标准差'
appl_std250.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x9dcb8d0>

'????？？？？不懂这里的意思。'
'要计算扩展矿口平均（expanding window mean），你可以将扩展窗口看做一个特殊的窗口，其长度与时间序列一样，但只需一期（或多期）--- '
'--- （译注：不设置就完全空，也不要太大，大了就无意义了。）即可计算一个值'
# Define expanding mean in terms of rolling_mean
expanding_mean = lambda x: rolling_mean(x, len(x), min_periods=1)

'对DataFrame调用rolling_mean（以及与之类似的函数）会将转换应用到所有的列上。'
pd.rolling_mean(close_px, 60).plot(logy=True)

<matplotlib.axes._subplots.AxesSubplot at 0x9fef1d0>

plt.close('all')

表10-6：移动窗口和指数加权函数¶

rolling_count      返回各窗口非NA观测值的数量
rolling_sum       移动窗口的和
rolling_mean       移动窗口的平均值
rolling_median     移动窗口的中位数
rolling_var，rolling_std  移动窗口的方差和标准差，分母为n-1
rolling_skew，rolling_kurt  移动窗口的偏度（三阶矩）和峰度（四阶矩）
rolling_min，rolling_max  移动窗口的最小值和最大值
rolling_quantile    移动窗口指定百分位数/样本分位数位置的值
rolling_corr，rolling_cov  移动窗口的相关系数和协方差
rolling_apply      对移动窗口应用普通数组函数
ewma            指数加权移动平均
ewmvar,ewmstd      指数加权移动方差和标准差
ewmcorr,ewmcov     指数加权移动相关系数和协方差

注意：bottleneck（由Keith Goodman制作的Python库）提供了另一种对NaN友好的移动窗口函数集。值得一看，说不定能在你的工作中派上用场。

指数加权函数¶

Exponentially-weighted functions

'简单移动平均与指数加权移动平均'
fig, axes = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True,
                         figsize=(12, 7))

aapl_px = close_px.AAPL['2005':'2009']

ma60 = pd.rolling_mean(aapl_px, 60, min_periods=50)
ewma60 = pd.ewma(aapl_px, span=60)

aapl_px.plot(style='k-', ax=axes[0])
ma60.plot(style='k--', ax=axes[0])
aapl_px.plot(style='k-', ax=axes[1])
ewma60.plot(style='k--', ax=axes[1])
axes[0].set_title('Simple MA')
axes[1].set_title('Exponentially-weighted MA')

<matplotlib.text.Text at 0xa126dd8>

二元移动窗口函数¶

Binary moving window functions

close_px
spx_px = close_px_all['SPX']

'计算AAPL对标准普尔500指数的相关系数：先计算百分比变化，再使用rolling_corr得到结果。'
spx_rets = spx_px / spx_px.shift(1) - 1
returns = close_px.pct_change()
corr = pd.rolling_corr(returns.AAPL, spx_rets, 125, min_periods=100)
corr.plot()
'AAPL6个月的回报与标准普尔500指数的相关系数'

<matplotlib.axes._subplots.AxesSubplot at 0xb2be668>

'计算多只股票与标准普尔500指数的相关系数：不用编写循环，与上面类似，只需要传入一个Series和DataFrame。'
corr = pd.rolling_corr(returns, spx_rets, 125, min_periods=100)
corr.plot()
'3只股票6个月的回报与标准普尔500指数的相关系数'

<matplotlib.axes._subplots.AxesSubplot at 0xb68f3c8>

用户定义的移动窗口函数¶

User-defined moving window functions
rolling_apply函数使你能够在移动窗口上应用自己设计的数组函数。唯一要求的就是：该函数要能从数组的各个片段中产生单个值（即简约）。比如说，当我们用rolling_quantile计算样本分位数时，可能对样本中特定值的百分等级感兴趣。scipy.stats.percentileofscore函数就能达到这个目的。

'????？？？？不懂为什么'
from scipy.stats import percentileofscore
score_at_2percent = lambda x: percentileofscore(x, 0.02)
result = pd.rolling_apply(returns.AAPL, 250, score_at_2percent)
result.plot()

<matplotlib.axes._subplots.AxesSubplot at 0xc50ae48>

性能和内存使用方面的注意事项¶

Performance and Memory Usage Notes

rng = pd.date_range('1/1/2000', periods=10000000, freq='10ms')
ts = Series(np.random.randn(len(rng)), index=rng)
ts

2000-01-01 00:00:00.000    0.082488
2000-01-01 00:00:00.010    0.455146
2000-01-01 00:00:00.020   -1.680831
2000-01-01 00:00:00.030   -1.254477
2000-01-01 00:00:00.040   -1.149900
2000-01-01 00:00:00.050   -0.123929
                             ...   
2000-01-02 03:46:39.940    0.314817
2000-01-02 03:46:39.950   -0.686495
2000-01-02 03:46:39.960    1.414570
2000-01-02 03:46:39.970    0.806492
2000-01-02 03:46:39.980   -0.842300
2000-01-02 03:46:39.990   -1.051648
Freq: 10L, dtype: float64

ts.resample('15min', how='ohlc').info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 112 entries, 2000-01-01 00:00:00 to 2000-01-02 03:45:00
Freq: 15T
Data columns (total 4 columns):
open     112 non-null float64
high     112 non-null float64
low      112 non-null float64
close    112 non-null float64
dtypes: float64(4)
memory usage: 4.4 KB

%timeit ts.resample('15min', how='ohlc')

1 loop, best of 3: 205 ms per loop

rng = pd.date_range('1/1/2000', periods=10000000, freq='1s')
ts = Series(np.random.randn(len(rng)), index=rng)
%timeit ts.resample('15s', how='ohlc')

1 loop, best of 3: 282 ms per loop

	Colorado	Texas	New York	Ohio
2001-05-02	0.732838	-0.353291	-0.818405	0.918050
2001-05-09	1.299361	0.280320	-0.343367	-0.974747
2001-05-16	-0.331728	0.034929	0.511152	0.059924
2001-05-23	0.106564	0.177225	0.398806	2.878547
2001-05-30	0.154713	-0.132160	-1.194429	0.161802

	Colorado	Texas	New York	Ohio
2000-01-05	-1.329064	-1.998497	-1.065641	-0.833572
2000-01-12	0.480313	1.707635	-0.046289	-0.624342

	Colorado	Texas	New York	Ohio
2000-01-05	-1.329064	-1.998497	-1.065641	-0.833572
2000-01-06	-1.329064	-1.998497	-1.065641	-0.833572
2000-01-07	-1.329064	-1.998497	-1.065641	-0.833572
2000-01-08	-1.329064	-1.998497	-1.065641	-0.833572
2000-01-09	-1.329064	-1.998497	-1.065641	-0.833572
2000-01-10	-1.329064	-1.998497	-1.065641	-0.833572
2000-01-11	-1.329064	-1.998497	-1.065641	-0.833572
2000-01-12	0.480313	1.707635	-0.046289	-0.624342

	Colorado	Texas	New York	Ohio
2000-01-06	-1.329064	-1.998497	-1.065641	-0.833572
2000-01-13	0.480313	1.707635	-0.046289	-0.624342

	Colorado	Texas	New York	Ohio
2000-01	2.065492	-0.131464	-0.204343	1.125383
2000-02	0.435945	-2.495770	0.371203	-0.939451
2000-03	-0.567021	-0.072074	2.133888	-2.727733
2000-04	0.736853	0.778237	0.229404	1.124998
2000-05	0.065841	0.448254	-0.671314	-0.340491

	open	high	low	close
2000-01-01 00:00:00	0	4	0	4
2000-01-01 00:05:00	5	9	5	9
2000-01-01 00:10:00	10	11	10	11

	Colorado	Texas	New York	Ohio
2000	0.324882	-0.031953	0.268689	-0.139799
2001	0.345896	-0.360383	0.209063	0.191363

	Colorado	Texas	New York	Ohio
2000Q1	0.324882	-0.031953	0.268689	-0.139799
2000Q2	0.324882	-0.031953	0.268689	-0.139799
2000Q3	0.324882	-0.031953	0.268689	-0.139799
2000Q4	0.324882	-0.031953	0.268689	-0.139799
2001Q1	0.345896	-0.360383	0.209063	0.191363
2001Q2	0.345896	-0.360383	0.209063	0.191363
2001Q3	0.345896	-0.360383	0.209063	0.191363
2001Q4	0.345896	-0.360383	0.209063	0.191363

she35

pandas笔记：ch10:时间序列