pandas笔记:ch11金融和经济数据应用

 

ch11

 

 

 

金融和经济数据应用

Financial and Economic Data Applications

In [1]:
from __future__ import division
from pandas import Series, DataFrame
import pandas as pd
from numpy.random import randn
import numpy as np
pd.options.display.max_rows = 12
np.set_printoptions(precision=4, suppress=True)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(12, 6))
In [2]:
%matplotlib inline
In [3]:
%pwd
Out[3]:
'D:\\zwPython\\py35\\notebooks\\Python for Data Analysis'
 

数据规整化方面的话题

Data munging topics

 

时间序列以及截面对齐

Time series and cross-section alignment

In [4]:
close_px = pd.read_csv('ch11/stock_px.csv', parse_dates=True, index_col=0)
volume = pd.read_csv('ch11/volume.csv', parse_dates=True, index_col=0)
prices = close_px.ix['2011-09-05':'2011-09-14', ['AAPL', 'JNJ', 'SPX', 'XOM']]
volume = volume.ix['2011-09-05':'2011-09-12', ['AAPL', 'JNJ', 'XOM']]
In [5]:
prices
Out[5]:
 AAPLJNJSPXXOM
2011-09-06 379.74 64.64 1165.24 71.15
2011-09-07 383.93 65.43 1198.62 73.65
2011-09-08 384.14 64.95 1185.90 72.82
2011-09-09 377.48 63.64 1154.23 71.01
2011-09-12 379.94 63.59 1162.27 71.84
2011-09-13 384.62 63.61 1172.87 71.65
2011-09-14 389.30 63.73 1188.68 72.64
In [6]:
volume
Out[6]:
 AAPLJNJXOM
2011-09-06 18173500 15848300 25416300
2011-09-07 12492000 10759700 23108400
2011-09-08 14839800 15551500 22434800
2011-09-09 20171900 17008200 27969100
2011-09-12 16697300 13448200 26205800
In [7]:
'pandas可以在算术运算中自动对齐数据。在实际工作当中,这不仅能为你带来极大的自由度,而且还能提高你的工作效率。'
prices * volume
Out[7]:
 AAPLJNJSPXXOM
2011-09-06 6901204890 1024434112 NaN 1808369745
2011-09-07 4796053560 704007171 NaN 1701933660
2011-09-08 5700560772 1010069925 NaN 1633702136
2011-09-09 7614488812 1082401848 NaN 1986085791
2011-09-12 6343972162 855171038 NaN 1882624672
2011-09-13 NaN NaN NaN NaN
2011-09-14 NaN NaN NaN NaN
In [8]:
'用所有有效数据计算一个成交量加权平均价格:由于pandas会在算术运算过程中自动将数据对齐,并在sum这样的函数中排除缺失数据,所以我们只需编写'
'如下这条简洁表达式即可'
vwap = (prices * volume).sum() / volume.sum()
In [9]:
vwap
Out[9]:
AAPL    380.655181
JNJ      64.394769
SPX            NaN
XOM      72.024288
dtype: float64
In [10]:
vwap.dropna()
Out[10]:
AAPL    380.655181
JNJ      64.394769
XOM      72.024288
dtype: float64
In [11]:
'如果希望手工进行对齐,可以使用DataFrame的align方法,它返回的是一个元祖,含有两个对象的重索引版本:'
prices.align(volume, join='inner')
Out[11]:
(              AAPL    JNJ    XOM
 2011-09-06  379.74  64.64  71.15
 2011-09-07  383.93  65.43  73.65
 2011-09-08  384.14  64.95  72.82
 2011-09-09  377.48  63.64  71.01
 2011-09-12  379.94  63.59  71.84,                 AAPL       JNJ       XOM
 2011-09-06  18173500  15848300  25416300
 2011-09-07  12492000  10759700  23108400
 2011-09-08  14839800  15551500  22434800
 2011-09-09  20171900  17008200  27969100
 2011-09-12  16697300  13448200  26205800)
In [12]:
'另一个不可或缺的功能是:通过索引不同的Series构建一个DataFrame'
s1 = Series(range(3), index=['a', 'b', 'c'])
s2 = Series(range(4), index=['d', 'b', 'c', 'e'])
s3 = Series(range(3), index=['f', 'a', 'c'])
DataFrame({'one': s1, 'two': s2, 'three': s3})
Out[12]:
 onethreetwo
a 0 1 NaN
b 1 NaN 1
c 2 2 2
d NaN NaN 0
e NaN NaN 3
f NaN 0 NaN
In [13]:
'显式定义结果的索引'
DataFrame({'one': s1, 'two': s2, 'three': s3}, index=list('face'))
Out[13]:
 onethreetwo
f NaN 0 NaN
a 0 1 NaN
c 2 2 2
e NaN NaN 3
 

频率不同的时间序列的运算

Operations with time series of different frequencies
resample用于将数据转换到固定频率。
reindex用于使数据符合一个新索引。他们都支持插值(如向前填充)逻辑。

In [4]:
ts1 = Series(np.random.randn(3),
             index=pd.date_range('2012-6-13', periods=3, freq='W-WED'))
ts1
Out[4]:
2012-06-13   -0.120931
2012-06-20    1.718694
2012-06-27   -2.271138
Freq: W-WED, dtype: float64
In [5]:
ts1.resample('B')
Out[5]:
2012-06-13   -0.120931
2012-06-14         NaN
2012-06-15         NaN
2012-06-18         NaN
2012-06-19         NaN
2012-06-20    1.718694
2012-06-21         NaN
2012-06-22         NaN
2012-06-25         NaN
2012-06-26         NaN
2012-06-27   -2.271138
Freq: B, dtype: float64
In [6]:
'低频到高频,填充NaN值'
ts1.resample('B', fill_method='ffill')
Out[6]:
2012-06-13   -0.120931
2012-06-14   -0.120931
2012-06-15   -0.120931
2012-06-18   -0.120931
2012-06-19   -0.120931
2012-06-20    1.718694
2012-06-21    1.718694
2012-06-22    1.718694
2012-06-25    1.718694
2012-06-26    1.718694
2012-06-27   -2.271138
Freq: B, dtype: float64
In [7]:
dates = pd.DatetimeIndex(['2012-6-12', '2012-6-17', '2012-6-18',
                          '2012-6-21', '2012-6-22', '2012-6-29'])
ts2 = Series(np.random.randn(6), index=dates)
ts2
Out[7]:
2012-06-12   -0.962345
2012-06-17    0.749038
2012-06-18    0.484430
2012-06-21    2.109155
2012-06-22    0.756567
2012-06-29    1.077439
dtype: float64
In [8]:
'对 ts1 采集 ts2.index 对应的数据'
ts1.reindex(ts2.index, method='ffill')
Out[8]:
2012-06-12         NaN
2012-06-17   -0.120931
2012-06-18   -0.120931
2012-06-21    1.718694
2012-06-22    1.718694
2012-06-29   -2.271138
dtype: float64
In [9]:
'ts1 + ts2 并维持index 为 ts2.index '
ts2 + ts1.reindex(ts2.index, method='ffill')
Out[9]:
2012-06-12         NaN
2012-06-17    0.628107
2012-06-18    0.363500
2012-06-21    3.827849
2012-06-22    2.475261
2012-06-29   -1.193699
dtype: float64
 

使用Period代替timestamps

Using periods instead of timestamps

In [10]:
'gdp 和 infl(通货膨胀) 的宏观经济时间序列'
gdp = Series([1.78, 1.94, 2.08, 2.01, 2.15, 2.31, 2.46],
             index=pd.period_range('1984Q2', periods=7, freq='Q-SEP'))
infl = Series([0.025, 0.045, 0.037, 0.04],
              index=pd.period_range('1982', periods=4, freq='A-DEC'))
gdp
Out[10]:
1984Q2    1.78
1984Q3    1.94
1984Q4    2.08
1985Q1    2.01
1985Q2    2.15
1985Q3    2.31
1985Q4    2.46
Freq: Q-SEP, dtype: float64
In [11]:
infl
Out[11]:
1982    0.025
1983    0.045
1984    0.037
1985    0.040
Freq: A-DEC, dtype: float64
In [12]:
'跟Timestamp的时间序列不同,Period索引的两个不同频率的时间序列之间的运算必须进行显式转换。'
'本例中已知 infl 是每年年末观测的,那么可以把它转换为 gdp 的频率'
infl_q = infl.asfreq('Q-SEP', how='end')
In [13]:
infl_q
Out[13]:
1983Q1    0.025
1984Q1    0.045
1985Q1    0.037
1986Q1    0.040
Freq: Q-SEP, dtype: float64
In [14]:
'现在可以对 infl 重新索引了'
infl_q.reindex(gdp.index, method='ffill')
Out[14]:
1984Q2    0.045
1984Q3    0.045
1984Q4    0.045
1985Q1    0.037
1985Q2    0.037
1985Q3    0.037
1985Q4    0.037
Freq: Q-SEP, dtype: float64
 

时间和‘最当前’的数据选取

Time of day and "as of" data selection

In [27]:
'生成一个交易日内的日期范围和时间序列索引'
rng = pd.date_range('2012-06-01 09:30', '2012-06-01 15:59', freq='T')
rng
Out[27]:
DatetimeIndex(['2012-06-01 09:30:00', '2012-06-01 09:31:00',
               '2012-06-01 09:32:00', '2012-06-01 09:33:00',
               '2012-06-01 09:34:00', '2012-06-01 09:35:00',
               '2012-06-01 09:36:00', '2012-06-01 09:37:00',
               '2012-06-01 09:38:00', '2012-06-01 09:39:00',
               ...
               '2012-06-01 15:50:00', '2012-06-01 15:51:00',
               '2012-06-01 15:52:00', '2012-06-01 15:53:00',
               '2012-06-01 15:54:00', '2012-06-01 15:55:00',
               '2012-06-01 15:56:00', '2012-06-01 15:57:00',
               '2012-06-01 15:58:00', '2012-06-01 15:59:00'],
              dtype='datetime64[ns]', length=390, freq='T')
In [28]:
'两个工作日'
pd.offsets.BDay(2)
Out[28]:
<2 * BusinessDays>
In [29]:
'三个工作日的日期范围和时间序列索引'
[rng + pd.offsets.BDay(i) for i in range(1, 4)]
Out[29]:
[DatetimeIndex(['2012-06-04 09:30:00', '2012-06-04 09:31:00',
                '2012-06-04 09:32:00', '2012-06-04 09:33:00',
                '2012-06-04 09:34:00', '2012-06-04 09:35:00',
                '2012-06-04 09:36:00', '2012-06-04 09:37:00',
                '2012-06-04 09:38:00', '2012-06-04 09:39:00',
                ...
                '2012-06-04 15:50:00', '2012-06-04 15:51:00',
                '2012-06-04 15:52:00', '2012-06-04 15:53:00',
                '2012-06-04 15:54:00', '2012-06-04 15:55:00',
                '2012-06-04 15:56:00', '2012-06-04 15:57:00',
                '2012-06-04 15:58:00', '2012-06-04 15:59:00'],
               dtype='datetime64[ns]', length=390, freq='T'),
 DatetimeIndex(['2012-06-05 09:30:00', '2012-06-05 09:31:00',
                '2012-06-05 09:32:00', '2012-06-05 09:33:00',
                '2012-06-05 09:34:00', '2012-06-05 09:35:00',
                '2012-06-05 09:36:00', '2012-06-05 09:37:00',
                '2012-06-05 09:38:00', '2012-06-05 09:39:00',
                ...
                '2012-06-05 15:50:00', '2012-06-05 15:51:00',
                '2012-06-05 15:52:00', '2012-06-05 15:53:00',
                '2012-06-05 15:54:00', '2012-06-05 15:55:00',
                '2012-06-05 15:56:00', '2012-06-05 15:57:00',
                '2012-06-05 15:58:00', '2012-06-05 15:59:00'],
               dtype='datetime64[ns]', length=390, freq='T'),
 DatetimeIndex(['2012-06-06 09:30:00', '2012-06-06 09:31:00',
                '2012-06-06 09:32:00', '2012-06-06 09:33:00',
                '2012-06-06 09:34:00', '2012-06-06 09:35:00',
                '2012-06-06 09:36:00', '2012-06-06 09:37:00',
                '2012-06-06 09:38:00', '2012-06-06 09:39:00',
                ...
                '2012-06-06 15:50:00', '2012-06-06 15:51:00',
                '2012-06-06 15:52:00', '2012-06-06 15:53:00',
                '2012-06-06 15:54:00', '2012-06-06 15:55:00',
                '2012-06-06 15:56:00', '2012-06-06 15:57:00',
                '2012-06-06 15:58:00', '2012-06-06 15:59:00'],
               dtype='datetime64[ns]', length=390, freq='T')]
In [30]:
'5天的日期范围和时间序列索引 (这里我只看到4天的) Make a 5-day series of 9:30-15:59 values'
rng = rng.append([rng + pd.offsets.BDay(i) for i in range(1, 4)])
rng
Out[30]:
DatetimeIndex(['2012-06-01 09:30:00', '2012-06-01 09:31:00',
               '2012-06-01 09:32:00', '2012-06-01 09:33:00',
               '2012-06-01 09:34:00', '2012-06-01 09:35:00',
               '2012-06-01 09:36:00', '2012-06-01 09:37:00',
               '2012-06-01 09:38:00', '2012-06-01 09:39:00',
               ...
               '2012-06-06 15:50:00', '2012-06-06 15:51:00',
               '2012-06-06 15:52:00', '2012-06-06 15:53:00',
               '2012-06-06 15:54:00', '2012-06-06 15:55:00',
               '2012-06-06 15:56:00', '2012-06-06 15:57:00',
               '2012-06-06 15:58:00', '2012-06-06 15:59:00'],
              dtype='datetime64[ns]', length=1560, freq=None)
In [23]:
ts = Series(np.arange(len(rng), dtype=float), index=rng)
ts
Out[23]:
2012-06-01 09:30:00       0
2012-06-01 09:31:00       1
2012-06-01 09:32:00       2
2012-06-01 09:33:00       3
2012-06-01 09:34:00       4
2012-06-01 09:35:00       5
                       ... 
2012-06-06 15:54:00    1554
2012-06-06 15:55:00    1555
2012-06-06 15:56:00    1556
2012-06-06 15:57:00    1557
2012-06-06 15:58:00    1558
2012-06-06 15:59:00    1559
dtype: float64
In [16]:
'datetime.time:抽出固定时间点上的值'
from datetime import time
ts[time(10, 0)]
Out[16]:
2012-06-01 10:00:00      30
2012-06-04 10:00:00     420
2012-06-05 10:00:00     810
2012-06-06 10:00:00    1200
dtype: float64
In [17]:
'x.at_time:实例化方法(各时间序列以及类似的DataFrame对象都有)'
ts.at_time(time(10, 0))
Out[17]:
2012-06-01 10:00:00      30
2012-06-04 10:00:00     420
2012-06-05 10:00:00     810
2012-06-06 10:00:00    1200
dtype: float64
In [18]:
'x.between_time:选取两个Time对象之间的值:'
ts.between_time(time(10, 0), time(10, 1))
Out[18]:
2012-06-01 10:00:00      30
2012-06-01 10:01:00      31
2012-06-04 10:00:00     420
2012-06-04 10:01:00     421
2012-06-05 10:00:00     810
2012-06-05 10:01:00     811
2012-06-06 10:00:00    1200
2012-06-06 10:01:00    1201
dtype: float64
 

有时候,在某个具体的时间上(比如说上午10点)刚好没有数据,这时,你可能希望得到上午10点之前最后出现的那个值:

In [19]:
np.random.seed(12346)
In [20]:
# Set most of the time series randomly to NA
indexer = np.sort(np.random.permutation(len(ts))[700:])
irr_ts = ts.copy()
irr_ts[indexer] = np.nan
irr_ts['2012-06-01 09:50':'2012-06-01 10:00']
Out[20]:
2012-06-01 09:50:00    20
2012-06-01 09:51:00   NaN
2012-06-01 09:52:00    22
2012-06-01 09:53:00    23
2012-06-01 09:54:00   NaN
2012-06-01 09:55:00    25
2012-06-01 09:56:00   NaN
2012-06-01 09:57:00   NaN
2012-06-01 09:58:00   NaN
2012-06-01 09:59:00   NaN
2012-06-01 10:00:00   NaN
dtype: float64
In [21]:
'x.asof: 将一组Timestamp传入asof方法,就可以得到这些时间点处(或其之前最近)的有效值(非NA)'
selection = pd.date_range('2012-06-01 10:00', periods=4, freq='B')
irr_ts.asof(selection)
Out[21]:
2012-06-01 10:00:00      25
2012-06-04 10:00:00     420
2012-06-05 10:00:00     810
2012-06-06 10:00:00    1197
Freq: B, dtype: float64
 

拼接多个数据源

Splicing together data sources
第七章中,我介绍了一些合并两个相关数据集的办法。在金融或经济领域中,还有另外几个经常出现的情况:
1 在一个特定的时间点上,从一个数据源切换到另一个数据源。
2 用另一个时间序列对当前时间序列中的缺失值‘打补丁’
3 将数据中的符号(国家、资产代码等)替换为实际数据

In [31]:
'pd.concat:在特定时刻从一个时间序列切换到另一个'
data1 = DataFrame(np.ones((6, 3), dtype=float),
                  columns=['a', 'b', 'c'],
                  index=pd.date_range('6/12/2012', periods=6))
data2 = DataFrame(np.ones((6, 3), dtype=float) * 2,
                  columns=['a', 'b', 'c'],
                  index=pd.date_range('6/13/2012', periods=6))
spliced = pd.concat([data1.ix[:'2012-06-14'], data2.ix['2012-06-15':]])
spliced
Out[31]:
 abc
2012-06-12 1 1 1
2012-06-13 1 1 1
2012-06-14 1 1 1
2012-06-15 2 2 2
2012-06-16 2 2 2
2012-06-17 2 2 2
2012-06-18 2 2 2
In [32]:
'data1缺失了data2中存在的某个时间序列'
data2 = DataFrame(np.ones((6, 4), dtype=float) * 2,
                  columns=['a', 'b', 'c', 'd'],
                  index=pd.date_range('6/13/2012', periods=6))
spliced = pd.concat([data1.ix[:'2012-06-14'], data2.ix['2012-06-15':]])
spliced
Out[32]:
 abcd
2012-06-12 1 1 1 NaN
2012-06-13 1 1 1 NaN
2012-06-14 1 1 1 NaN
2012-06-15 2 2 2 2
2012-06-16 2 2 2 2
2012-06-17 2 2 2 2
2012-06-18 2 2 2 2
In [33]:
'x.combine_first:可以引入合并之前的数据,这样也扩展了“d”项的历史(注意:data2没有2012-06-12那天的数据,所以那天的数据没有填充)'
spliced_filled = spliced.combine_first(data2)
spliced_filled
Out[33]:
 abcd
2012-06-12 1 1 1 NaN
2012-06-13 1 1 1 2
2012-06-14 1 1 1 2
2012-06-15 2 2 2 2
2012-06-16 2 2 2 2
2012-06-17 2 2 2 2
2012-06-18 2 2 2 2
In [34]:
'DataFrame也有一个类似的update,它可以实现就地更新。如果只想填充空洞,则必须传入overwrite=False'
spliced.update(data2, overwrite=False)
In [35]:
spliced
Out[35]:
 abcd
2012-06-12 1 1 1 NaN
2012-06-13 1 1 1 2
2012-06-14 1 1 1 2
2012-06-15 2 2 2 2
2012-06-16 2 2 2 2
2012-06-17 2 2 2 2
2012-06-18 2 2 2 2
In [36]:
'上面所讲的这些技术都可以实现将数据中的符号替换为实际数据,但有时利用DataFrame的索引机制直接对列进行设置会更简单一些:'
cp_spliced = spliced.copy()
cp_spliced[['a', 'c']] = data1[['a', 'c']]
cp_spliced
Out[36]:
 abcd
2012-06-12 1 1 1 NaN
2012-06-13 1 1 1 2
2012-06-14 1 1 1 2
2012-06-15 1 2 1 2
2012-06-16 1 2 1 2
2012-06-17 1 2 1 2
2012-06-18 NaN 2 NaN 2
 

收益指数和累计收益

Return indexes and cumulative returns

In [42]:
import pandas_datareader.data as web
price = web.get_data_yahoo('AAPL', '2011-01-01', '2012-07-27')['Adj Close']
price[-5:]
Out[42]:
Date
2012-07-23    79.885785
2012-07-24    79.500793
2012-07-25    76.067652
2012-07-26    76.055747
2012-07-27    77.415771
Name: Adj Close, dtype: float64
In [43]:
'计算两个时间点之间的累计百分比回报:'
price['2011-10-03'] / price['2011-3-01'] - 1
Out[43]:
0.072399889918845162
In [44]:
'利用cumprod计算出一个简单的收益指数'
returns = price.pct_change()
ret_index = (1 + returns).cumprod()
ret_index[0] = 1  # Set first value to 1
ret_index
Out[44]:
Date
2011-01-03    1.000000
2011-01-04    1.005219
2011-01-05    1.013442
2011-01-06    1.012622
2011-01-07    1.019874
2011-01-10    1.039081
                ...   
2012-07-20    1.833601
2012-07-23    1.832175
2012-07-24    1.823345
2012-07-25    1.744607
2012-07-26    1.744334
2012-07-27    1.775526
Name: Adj Close, dtype: float64
In [45]:
'从收益指数中计算指定时期内的累计收益'
m_returns = ret_index.resample('BM', how='last').pct_change()
m_returns['2012']
Out[45]:
Date
2012-01-31    0.127111
2012-02-29    0.188311
2012-03-30    0.105284
2012-04-30   -0.025970
2012-05-31   -0.010702
2012-06-29    0.010853
2012-07-31    0.001986
Freq: BM, Name: Adj Close, dtype: float64
In [46]:
'就这个简单的例子而言(没有股息也没有其他需要考虑的调整),上面的结果也能通过重采样聚合(这里聚合为时期)从日百分比变化中计算得出:'
m_rets = (1 + returns).resample('M', how='prod', kind='period') - 1
m_rets['2012']
Out[46]:
Date
2012-01    0.127111
2012-02    0.188311
2012-03    0.105284
2012-04   -0.025970
2012-05   -0.010702
2012-06    0.010853
2012-07    0.001986
Freq: M, Name: Adj Close, dtype: float64
如果知道了股息的派发日和支付率,就可以将他们计入到每日总收益中。 returns[dividend_dates] += dividend_pcts
 

分组变换和分析

Group transforms and analysis

In [61]:
pd.options.display.max_rows = 20
pd.options.display.max_columns = 10
np.random.seed(12345)
In [50]:
'随机生成 1000 个股票代码'
import random; random.seed(0)
import string

N = 1000
def rands(n):
    choices = string.ascii_uppercase # 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    'random.choice(): 从一个非空序列中随机选择元素'
    return ''.join([random.choice(choices) for _ in range(n)]) #返回 含有5个大写字母的字符串 即为股票代码
tickers = np.array([rands(5) for _ in range(N)]) #生成1000个股票代码
In [51]:
'生成3列假想的数据,只选择部分股票(500只)作为投资组合。'
M = 500
df = DataFrame({'Momentum' : np.random.randn(M) / 200 + 0.03,
                'Value' : np.random.randn(M) / 200 + 0.08,
                'ShortInterest' : np.random.randn(M) / 200 - 0.02},
                index=tickers[:M])
In [52]:
'随机进行行业分类:这里以2个行业为例'
ind_names = np.array(['FINANCIAL', 'TECH'])
sampler = np.random.randint(0, len(ind_names), N)
industries = Series(ind_names[sampler], index=tickers,
                    name='industry')
In [53]:
'根据行业分类进行分组,并执行分组聚合和变换'
by_industry = df.groupby(industries)
by_industry.mean()
Out[53]:
 MomentumShortInterestValue
industry   
FINANCIAL 0.030309 -0.019859 0.079973
TECH 0.030339 -0.019895 0.079944
In [54]:
by_industry.describe()
Out[54]:
  MomentumShortInterestValue
industry    
FINANCIALcount 257.000000 257.000000 257.000000
mean 0.030309 -0.019859 0.079973
std 0.005000 0.004920 0.004730
min 0.014715 -0.030405 0.065925
25% 0.027442 -0.023847 0.076857
50% 0.030183 -0.020059 0.079866
75% 0.033918 -0.016045 0.083360
max 0.044772 -0.007475 0.092568
TECHcount 243.000000 243.000000 243.000000
mean 0.030339 -0.019895 0.079944
std 0.005327 0.005152 0.005269
min 0.011273 -0.033259 0.062859
25% 0.027076 -0.023835 0.076912
50% 0.030167 -0.019313 0.079856
75% 0.034145 -0.016168 0.083262
max 0.042123 -0.008714 0.093268
In [55]:
'行业内标准化处理'
# Within-Industry Standardize
def zscore(group):
    return (group - group.mean()) / group.std()

df_stand = by_industry.apply(zscore)
In [56]:
'处理后各行业均值为0,标准差为1'
df_stand.groupby(industries).agg(['mean', 'std'])
Out[56]:
 MomentumShortInterestValue
 meanstdmeanstdmeanstd
industry      
FINANCIAL -3.784262e-16 1 -5.589995e-16 1 1.559237e-14 1
TECH 3.503370e-15 1 -1.908853e-15 1 2.999887e-15 1
In [57]:
'内置变换函数(如rank)的用法会更简洁一些:'
# Within-industry rank descending
ind_rank = by_industry.rank(ascending=False)
ind_rank.groupby(industries).agg(['min', 'max'])
Out[57]:
 MomentumShortInterestValue
 minmaxminmaxminmax
industry      
FINANCIAL 1 257 1 257 1 257
TECH 1 243 1 243 1 243
In [62]:
'在股票投资组合的定量分析中,‘排名和标准化’是一种很常见的变换运算组合。通过将rank和zscore连接在一起即可完成整个变换过程。'
# Industry rank and standardize
by_industry.apply(lambda x: zscore(x.rank()))
Out[62]:
 MomentumShortInterestValue
MYNBI -0.369884 -1.095427 -0.768221
QPMZJ 0.156490 1.593348 1.593348
PLSGQ -0.867806 -0.298753 0.668637
EJEYD 0.469469 -0.967390 -0.369884
TZIRW 0.497921 -1.138106 1.451085
ZTEJD 1.180785 0.796674 1.323048
XCVKP -0.512148 -0.882032 -1.493764
RDLNK 1.452907 0.551567 -0.753359
TUGRP 1.607574 0.156490 0.611732
OQIBZ -1.564895 1.707158 1.365727
... ... ... ...
PJDOE -1.607574 0.938937 1.095427
LHXFK -0.309415 -0.632284 0.807171
PHDVM -0.156490 -0.085358 0.184942
MOQOS 0.711316 -0.697090 -0.910484
THVMQ 1.251916 -1.337274 0.540600
JPHKQ -1.379953 0.953163 -0.412563
VACPK -0.426790 0.455242 -1.024295
MHNBS -0.384111 0.341432 -0.810900
YBNCI -1.721964 0.834077 1.547077
GXKFD 1.579122 1.678706 1.109653

500 rows × 3 columns

In [64]:
by_industry.rank() #各列的值是该列对应的排名位置
Out[64]:
 MomentumShortInterestValue
MYNBI 96 45 68
QPMZJ 133 234 234
PLSGQ 61 101 169
EJEYD 155 54 96
TZIRW 157 42 224
ZTEJD 205 178 215
XCVKP 86 60 17
RDLNK 237 170 73
TUGRP 235 133 165
OQIBZ 12 242 218
... ... ... ...
PJDOE 9 188 199
LHXFK 106 82 189
PHDVM 111 116 135
MOQOS 172 73 58
THVMQ 210 28 160
JPHKQ 25 189 93
VACPK 92 154 50
MHNBS 95 146 65
YBNCI 1 191 244
GXKFD 233 240 200

500 rows × 3 columns

 

分组因子暴露

Group factor exposures

In [65]:
from numpy.random import rand
fac1, fac2, fac3 = np.random.rand(3, 1000) #构建 3 个因子,np.random.rand返回 [0,1) 之间的均匀分布

ticker_subset = tickers.take(np.random.permutation(N)[:1000]) # tickers 包含 1000 个股票的代码

# 因子加权和以及白噪声 Weighted sum of factors plus noise
port = Series(0.7 * fac1 - 1.2 * fac2 + 0.3 * fac3 + rand(1000),
              index=ticker_subset)
factors = DataFrame({'f1': fac1, 'f2': fac2, 'f3': fac3},
                    index=ticker_subset)
In [66]:
'各因子与投资组合之间的矢量相关性可能说明不了什么问题。'
factors.corrwith(port)
Out[66]:
f1    0.386231
f2   -0.668214
f3    0.215809
dtype: float64
In [67]:
'pd.ols:最小二乘回归,用来计算整个投资组合的暴露。'
pd.ols(y=port, x=factors).beta
'可以看出,由于没有给投资组合添加过多的随机噪声,原始的因子权重基本上可算是恢复出来了。'
Out[67]:
f1           0.721780
f2          -1.179640
f3           0.337076
intercept    0.459025
dtype: float64
In [68]:
'通过gruopby计算各行业的暴露量。为实现这个目的,我们先编写一个函数'
def beta_exposure(chunk, factors=None):
    return pd.ols(y=chunk, x=factors).beta
In [69]:
'然后根据行业进行分组,并应用该函数,传入因子载荷的DataFrame'
by_ind = port.groupby(industries) # 通过 industries 对 port 分组
exposures = by_ind.apply(beta_exposure, factors=factors)
exposures.unstack()
Out[69]:
 f1f2f3intercept
industry    
FINANCIAL 0.756877 -1.206185 0.27380 0.489334
TECH 0.682553 -1.156423 0.40224 0.431269
 

十分位和四分卫分析(从本节往下内容略过)。

Decile and quartile analysis

In [74]:
import pandas_datareader.data as web
data = web.get_data_yahoo('SPY', '2006-01-01','2012-07-27')
data.info()
 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1655 entries, 2006-01-03 to 2012-07-27
Data columns (total 6 columns):
Open         1655 non-null float64
High         1655 non-null float64
Low          1655 non-null float64
Close        1655 non-null float64
Volume       1655 non-null int64
Adj Close    1655 non-null float64
dtypes: float64(5), int64(1)
memory usage: 90.5 KB
In [75]:
px = data['Adj Close']
returns = px.pct_change()

def to_index(rets):
    index = (1 + rets).cumprod()
    first_loc = max(index.index.get_loc(index.idxmax()) - 1, 0)
    index.values[first_loc] = 1
    return index

def trend_signal(rets, lookback, lag):
    signal = pd.rolling_sum(rets, lookback, min_periods=lookback - 5)
    return signal.shift(lag)
In [76]:
signal = trend_signal(returns, 100, 3)
trade_friday = signal.resample('W-FRI').resample('B', fill_method='ffill')
trade_rets = trade_friday.shift(1) * returns
trade_rets = trade_rets[:len(returns)]
In [77]:
to_index(trade_rets).plot()
Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0xb364198>
 
In [78]:
vol = pd.rolling_std(returns, 250, min_periods=200) * np.sqrt(250)

def sharpe(rets, ann=250):
    return rets.mean() / rets.std()  * np.sqrt(ann)
In [79]:
cats = pd.qcut(vol, 4)
print('cats: %d, trade_rets: %d, vol: %d' % (len(cats), len(trade_rets), len(vol)))
 
cats: 1655, trade_rets: 1655, vol: 1655
In [80]:
trade_rets.groupby(cats).agg(sharpe)
Out[80]:
Adj Close
[0.0954, 0.16]   -0.012379
(0.16, 0.188]     0.278561
(0.188, 0.231]   -0.154232
(0.231, 0.457]    0.526424
Name: Adj Close, dtype: float64
 

More example applications

 

Signal frontier analysis

In [81]:
names = ['AAPL', 'GOOG', 'MSFT', 'DELL', 'GS', 'MS', 'BAC', 'C']
def get_px(stock, start, end):
    return web.get_data_yahoo(stock, start, end)['Adj Close']
px = DataFrame({n: get_px(n, None, None) for n in names})
In [82]:
#px = pd.read_csv('ch11/stock_px.csv')
In [83]:
plt.close('all')
In [84]:
px = px.asfreq('B').fillna(method='pad')
rets = px.pct_change()
((1 + rets).cumprod() - 1).plot()
Out[84]:
<matplotlib.axes._subplots.AxesSubplot at 0xb635208>
 
In [85]:
def calc_mom(price, lookback, lag):
    mom_ret = price.shift(lag).pct_change(lookback)
    ranks = mom_ret.rank(axis=1, ascending=False)
    demeaned = ranks.subtract(ranks.mean(axis=1), axis=0)
    return demeaned.divide(demeaned.std(axis=1), axis=0)
In [86]:
compound = lambda x : (1 + x).prod() - 1
daily_sr = lambda x: x.mean() / x.std()

def strat_sr(prices, lb, hold):
    # Compute portfolio weights
    freq = '%dB' % hold
    port = calc_mom(prices, lb, lag=1)

    daily_rets = prices.pct_change()

    # Compute portfolio returns
    port = port.shift(1).resample(freq, how='first')
    returns = daily_rets.resample(freq, how=compound)
    port_rets = (port * returns).sum(axis=1)

    return daily_sr(port_rets) * np.sqrt(252 / hold)
In [87]:
strat_sr(px, 70, 30)
Out[87]:
-0.27845051234214518
In [88]:
from collections import defaultdict

lookbacks = range(20, 90, 5)
holdings = range(20, 90, 5)
dd = defaultdict(dict)
for lb in lookbacks:
    for hold in holdings:
        dd[lb][hold] = strat_sr(px, lb, hold)

ddf = DataFrame(dd)
ddf.index.name = 'Holding Period'
ddf.columns.name = 'Lookback Period'
In [91]:
import matplotlib.pyplot as plt

def heatmap(df, cmap=plt.cm.gray_r):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    axim = ax.imshow(df.values, cmap=cmap, interpolation='nearest')
    ax.set_xlabel(df.columns.name)
    ax.set_xticks(np.arange(len(df.columns)))
    ax.set_xticklabels(list(df.columns))
    ax.set_ylabel(df.index.name)
    ax.set_yticks(np.arange(len(df.index)))
    ax.set_yticklabels(list(df.index))
    plt.colorbar(axim)
In [92]:
heatmap(ddf)
 
 

Future contract rolling

In [93]:
pd.options.display.max_rows = 10
In [94]:
import pandas.io.data as web
# Approximate price of S&P 500 index
px = web.get_data_yahoo('SPY')['Adj Close'] * 10
px
Out[94]:
Date
2010-01-04     998.08658
2010-01-05    1000.72861
2010-01-06    1001.43318
2010-01-07    1005.66052
2010-01-08    1009.00712
                 ...    
2016-03-30    2060.20004
2016-03-31    2055.20004
2016-04-01    2069.19998
2016-04-04    2062.50000
2016-04-05    2041.90002
Name: Adj Close, dtype: float64
In [96]:
from datetime import datetime
expiry = {'ESU2': datetime(2012, 9, 21),
          'ESZ2': datetime(2012, 12, 21)}
expiry = Series(expiry).sort_values()
In [97]:
expiry
Out[97]:
ESU2   2012-09-21
ESZ2   2012-12-21
dtype: datetime64[ns]
In [98]:
np.random.seed(12347)
N = 200
walk = (np.random.randint(0, 200, size=N) - 100) * 0.25
perturb = (np.random.randint(0, 20, size=N) - 10) * 0.25
walk = walk.cumsum()

rng = pd.date_range(px.index[0], periods=len(px) + N, freq='B')
near = np.concatenate([px.values, px.values[-1] + walk])
far = np.concatenate([px.values, px.values[-1] + walk + perturb])
prices = DataFrame({'ESU2': near, 'ESZ2': far}, index=rng)
In [99]:
prices.tail()
Out[99]:
 ESU2ESZ2
2016-10-14 2071.15002 2072.90002
2016-10-17 2057.40002 2059.65002
2016-10-18 2065.40002 2067.15002
2016-10-19 2081.90002 2081.15002
2016-10-20 2061.90002 2059.65002
In [100]:
def get_roll_weights(start, expiry, items, roll_periods=5):
    # start : first date to compute weighting DataFrame
    # expiry : Series of ticker -> expiration dates
    # items : sequence of contract names

    dates = pd.date_range(start, expiry[-1], freq='B')
    weights = DataFrame(np.zeros((len(dates), len(items))),
                        index=dates, columns=items)

    prev_date = weights.index[0]
    for i, (item, ex_date) in enumerate(expiry.iteritems()):
        if i < len(expiry) - 1:
            weights.ix[prev_date:ex_date - pd.offsets.BDay(), item] = 1
            roll_rng = pd.date_range(end=ex_date - pd.offsets.BDay(),
                                     periods=roll_periods + 1, freq='B')

            decay_weights = np.linspace(0, 1, roll_periods + 1)
            weights.ix[roll_rng, item] = 1 - decay_weights
            weights.ix[roll_rng, expiry.index[i + 1]] = decay_weights
        else:
            weights.ix[prev_date:, item] = 1

        prev_date = ex_date

    return weights
In [101]:
weights = get_roll_weights('6/1/2012', expiry, prices.columns)
weights.ix['2012-09-12':'2012-09-21']
Out[101]:
 ESU2ESZ2
2012-09-12 1.0 0.0
2012-09-13 1.0 0.0
2012-09-14 0.8 0.2
2012-09-17 0.6 0.4
2012-09-18 0.4 0.6
2012-09-19 0.2 0.8
2012-09-20 0.0 1.0
2012-09-21 0.0 1.0
In [102]:
rolled_returns = (prices.pct_change() * weights).sum(1)
 

Rolling correlation and linear regression

In [103]:
aapl = web.get_data_yahoo('AAPL', '2000-01-01')['Adj Close']
msft = web.get_data_yahoo('MSFT', '2000-01-01')['Adj Close']

aapl_rets = aapl.pct_change()
msft_rets = msft.pct_change()
In [104]:
plt.figure()
Out[104]:
<matplotlib.figure.Figure at 0xb8419e8>
 
<matplotlib.figure.Figure at 0xb8419e8>
In [105]:
pd.rolling_corr(aapl_rets, msft_rets, 250).plot()
Out[105]:
<matplotlib.axes._subplots.AxesSubplot at 0xb862d68>
 
In [106]:
plt.figure()
Out[106]:
<matplotlib.figure.Figure at 0xb862630>
 
<matplotlib.figure.Figure at 0xb862630>
In [107]:
model = pd.ols(y=aapl_rets, x={'MSFT': msft_rets}, window=250)
model.beta
Out[107]:
 MSFTintercept
Date  
2000-12-28 0.429021 -0.002113
2000-12-29 0.421103 -0.001796
2001-01-02 0.420596 -0.001839
2001-01-03 0.433292 -0.001289
2001-01-04 0.432772 -0.001307
... ... ...
2016-03-30 0.590460 -0.001149
2016-03-31 0.592755 -0.001241
2016-04-01 0.593577 -0.001211
2016-04-04 0.592711 -0.001121
2016-04-05 0.592977 -0.001125

3839 rows × 2 columns

In [108]:
model.beta['MSFT'].plot()
Out[108]:
<matplotlib.axes._subplots.AxesSubplot at 0xb704518>
 
In [ ]:
 
posted @ 2017-04-02 18:08  she35  阅读(262)  评论(0)    收藏  举报