金融和经济数据应用¶

Financial and Economic Data Applications

from __future__ import division
from pandas import Series, DataFrame
import pandas as pd
from numpy.random import randn
import numpy as np
pd.options.display.max_rows = 12
np.set_printoptions(precision=4, suppress=True)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(12, 6))

%matplotlib inline

%pwd

'D:\\zwPython\\py35\\notebooks\\Python for Data Analysis'

数据规整化方面的话题¶

Data munging topics

时间序列以及截面对齐¶

Time series and cross-section alignment

close_px = pd.read_csv('ch11/stock_px.csv', parse_dates=True, index_col=0)
volume = pd.read_csv('ch11/volume.csv', parse_dates=True, index_col=0)
prices = close_px.ix['2011-09-05':'2011-09-14', ['AAPL', 'JNJ', 'SPX', 'XOM']]
volume = volume.ix['2011-09-05':'2011-09-12', ['AAPL', 'JNJ', 'XOM']]

prices

volume

'pandas可以在算术运算中自动对齐数据。在实际工作当中，这不仅能为你带来极大的自由度，而且还能提高你的工作效率。'
prices * volume

'用所有有效数据计算一个成交量加权平均价格：由于pandas会在算术运算过程中自动将数据对齐，并在sum这样的函数中排除缺失数据，所以我们只需编写'
'如下这条简洁表达式即可'
vwap = (prices * volume).sum() / volume.sum()

vwap

AAPL    380.655181
JNJ      64.394769
SPX            NaN
XOM      72.024288
dtype: float64

vwap.dropna()

AAPL    380.655181
JNJ      64.394769
XOM      72.024288
dtype: float64

'如果希望手工进行对齐，可以使用DataFrame的align方法，它返回的是一个元祖，含有两个对象的重索引版本：'
prices.align(volume, join='inner')

(              AAPL    JNJ    XOM
 2011-09-06  379.74  64.64  71.15
 2011-09-07  383.93  65.43  73.65
 2011-09-08  384.14  64.95  72.82
 2011-09-09  377.48  63.64  71.01
 2011-09-12  379.94  63.59  71.84,                 AAPL       JNJ       XOM
 2011-09-06  18173500  15848300  25416300
 2011-09-07  12492000  10759700  23108400
 2011-09-08  14839800  15551500  22434800
 2011-09-09  20171900  17008200  27969100
 2011-09-12  16697300  13448200  26205800)

'另一个不可或缺的功能是：通过索引不同的Series构建一个DataFrame'
s1 = Series(range(3), index=['a', 'b', 'c'])
s2 = Series(range(4), index=['d', 'b', 'c', 'e'])
s3 = Series(range(3), index=['f', 'a', 'c'])
DataFrame({'one': s1, 'two': s2, 'three': s3})

'显式定义结果的索引'
DataFrame({'one': s1, 'two': s2, 'three': s3}, index=list('face'))

频率不同的时间序列的运算¶

Operations with time series of different frequencies
resample用于将数据转换到固定频率。
reindex用于使数据符合一个新索引。他们都支持插值（如向前填充）逻辑。

ts1 = Series(np.random.randn(3),
             index=pd.date_range('2012-6-13', periods=3, freq='W-WED'))
ts1

2012-06-13   -0.120931
2012-06-20    1.718694
2012-06-27   -2.271138
Freq: W-WED, dtype: float64

ts1.resample('B')

2012-06-13   -0.120931
2012-06-14         NaN
2012-06-15         NaN
2012-06-18         NaN
2012-06-19         NaN
2012-06-20    1.718694
2012-06-21         NaN
2012-06-22         NaN
2012-06-25         NaN
2012-06-26         NaN
2012-06-27   -2.271138
Freq: B, dtype: float64

'低频到高频，填充NaN值'
ts1.resample('B', fill_method='ffill')

2012-06-13   -0.120931
2012-06-14   -0.120931
2012-06-15   -0.120931
2012-06-18   -0.120931
2012-06-19   -0.120931
2012-06-20    1.718694
2012-06-21    1.718694
2012-06-22    1.718694
2012-06-25    1.718694
2012-06-26    1.718694
2012-06-27   -2.271138
Freq: B, dtype: float64

dates = pd.DatetimeIndex(['2012-6-12', '2012-6-17', '2012-6-18',
                          '2012-6-21', '2012-6-22', '2012-6-29'])
ts2 = Series(np.random.randn(6), index=dates)
ts2

2012-06-12   -0.962345
2012-06-17    0.749038
2012-06-18    0.484430
2012-06-21    2.109155
2012-06-22    0.756567
2012-06-29    1.077439
dtype: float64

'对 ts1 采集 ts2.index 对应的数据'
ts1.reindex(ts2.index, method='ffill')

2012-06-12         NaN
2012-06-17   -0.120931
2012-06-18   -0.120931
2012-06-21    1.718694
2012-06-22    1.718694
2012-06-29   -2.271138
dtype: float64

'ts1 + ts2 并维持index 为 ts2.index '
ts2 + ts1.reindex(ts2.index, method='ffill')

2012-06-12         NaN
2012-06-17    0.628107
2012-06-18    0.363500
2012-06-21    3.827849
2012-06-22    2.475261
2012-06-29   -1.193699
dtype: float64

使用Period代替timestamps¶

Using periods instead of timestamps

'gdp 和 infl(通货膨胀) 的宏观经济时间序列'
gdp = Series([1.78, 1.94, 2.08, 2.01, 2.15, 2.31, 2.46],
             index=pd.period_range('1984Q2', periods=7, freq='Q-SEP'))
infl = Series([0.025, 0.045, 0.037, 0.04],
              index=pd.period_range('1982', periods=4, freq='A-DEC'))
gdp

1984Q2    1.78
1984Q3    1.94
1984Q4    2.08
1985Q1    2.01
1985Q2    2.15
1985Q3    2.31
1985Q4    2.46
Freq: Q-SEP, dtype: float64

infl

1982    0.025
1983    0.045
1984    0.037
1985    0.040
Freq: A-DEC, dtype: float64

'跟Timestamp的时间序列不同，Period索引的两个不同频率的时间序列之间的运算必须进行显式转换。'
'本例中已知 infl 是每年年末观测的，那么可以把它转换为 gdp 的频率'
infl_q = infl.asfreq('Q-SEP', how='end')

infl_q

1983Q1    0.025
1984Q1    0.045
1985Q1    0.037
1986Q1    0.040
Freq: Q-SEP, dtype: float64

'现在可以对 infl 重新索引了'
infl_q.reindex(gdp.index, method='ffill')

1984Q2    0.045
1984Q3    0.045
1984Q4    0.045
1985Q1    0.037
1985Q2    0.037
1985Q3    0.037
1985Q4    0.037
Freq: Q-SEP, dtype: float64

时间和‘最当前’的数据选取¶

Time of day and "as of" data selection

'生成一个交易日内的日期范围和时间序列索引'
rng = pd.date_range('2012-06-01 09:30', '2012-06-01 15:59', freq='T')
rng

DatetimeIndex(['2012-06-01 09:30:00', '2012-06-01 09:31:00',
               '2012-06-01 09:32:00', '2012-06-01 09:33:00',
               '2012-06-01 09:34:00', '2012-06-01 09:35:00',
               '2012-06-01 09:36:00', '2012-06-01 09:37:00',
               '2012-06-01 09:38:00', '2012-06-01 09:39:00',
               ...
               '2012-06-01 15:50:00', '2012-06-01 15:51:00',
               '2012-06-01 15:52:00', '2012-06-01 15:53:00',
               '2012-06-01 15:54:00', '2012-06-01 15:55:00',
               '2012-06-01 15:56:00', '2012-06-01 15:57:00',
               '2012-06-01 15:58:00', '2012-06-01 15:59:00'],
              dtype='datetime64[ns]', length=390, freq='T')

'两个工作日'
pd.offsets.BDay(2)

<2 * BusinessDays>

'三个工作日的日期范围和时间序列索引'
[rng + pd.offsets.BDay(i) for i in range(1, 4)]

[DatetimeIndex(['2012-06-04 09:30:00', '2012-06-04 09:31:00',
                '2012-06-04 09:32:00', '2012-06-04 09:33:00',
                '2012-06-04 09:34:00', '2012-06-04 09:35:00',
                '2012-06-04 09:36:00', '2012-06-04 09:37:00',
                '2012-06-04 09:38:00', '2012-06-04 09:39:00',
                ...
                '2012-06-04 15:50:00', '2012-06-04 15:51:00',
                '2012-06-04 15:52:00', '2012-06-04 15:53:00',
                '2012-06-04 15:54:00', '2012-06-04 15:55:00',
                '2012-06-04 15:56:00', '2012-06-04 15:57:00',
                '2012-06-04 15:58:00', '2012-06-04 15:59:00'],
               dtype='datetime64[ns]', length=390, freq='T'),
 DatetimeIndex(['2012-06-05 09:30:00', '2012-06-05 09:31:00',
                '2012-06-05 09:32:00', '2012-06-05 09:33:00',
                '2012-06-05 09:34:00', '2012-06-05 09:35:00',
                '2012-06-05 09:36:00', '2012-06-05 09:37:00',
                '2012-06-05 09:38:00', '2012-06-05 09:39:00',
                ...
                '2012-06-05 15:50:00', '2012-06-05 15:51:00',
                '2012-06-05 15:52:00', '2012-06-05 15:53:00',
                '2012-06-05 15:54:00', '2012-06-05 15:55:00',
                '2012-06-05 15:56:00', '2012-06-05 15:57:00',
                '2012-06-05 15:58:00', '2012-06-05 15:59:00'],
               dtype='datetime64[ns]', length=390, freq='T'),
 DatetimeIndex(['2012-06-06 09:30:00', '2012-06-06 09:31:00',
                '2012-06-06 09:32:00', '2012-06-06 09:33:00',
                '2012-06-06 09:34:00', '2012-06-06 09:35:00',
                '2012-06-06 09:36:00', '2012-06-06 09:37:00',
                '2012-06-06 09:38:00', '2012-06-06 09:39:00',
                ...
                '2012-06-06 15:50:00', '2012-06-06 15:51:00',
                '2012-06-06 15:52:00', '2012-06-06 15:53:00',
                '2012-06-06 15:54:00', '2012-06-06 15:55:00',
                '2012-06-06 15:56:00', '2012-06-06 15:57:00',
                '2012-06-06 15:58:00', '2012-06-06 15:59:00'],
               dtype='datetime64[ns]', length=390, freq='T')]

'5天的日期范围和时间序列索引 （这里我只看到4天的） Make a 5-day series of 9:30-15:59 values'
rng = rng.append([rng + pd.offsets.BDay(i) for i in range(1, 4)])
rng

DatetimeIndex(['2012-06-01 09:30:00', '2012-06-01 09:31:00',
               '2012-06-01 09:32:00', '2012-06-01 09:33:00',
               '2012-06-01 09:34:00', '2012-06-01 09:35:00',
               '2012-06-01 09:36:00', '2012-06-01 09:37:00',
               '2012-06-01 09:38:00', '2012-06-01 09:39:00',
               ...
               '2012-06-06 15:50:00', '2012-06-06 15:51:00',
               '2012-06-06 15:52:00', '2012-06-06 15:53:00',
               '2012-06-06 15:54:00', '2012-06-06 15:55:00',
               '2012-06-06 15:56:00', '2012-06-06 15:57:00',
               '2012-06-06 15:58:00', '2012-06-06 15:59:00'],
              dtype='datetime64[ns]', length=1560, freq=None)

ts = Series(np.arange(len(rng), dtype=float), index=rng)
ts

2012-06-01 09:30:00       0
2012-06-01 09:31:00       1
2012-06-01 09:32:00       2
2012-06-01 09:33:00       3
2012-06-01 09:34:00       4
2012-06-01 09:35:00       5
                       ... 
2012-06-06 15:54:00    1554
2012-06-06 15:55:00    1555
2012-06-06 15:56:00    1556
2012-06-06 15:57:00    1557
2012-06-06 15:58:00    1558
2012-06-06 15:59:00    1559
dtype: float64

'datetime.time：抽出固定时间点上的值'
from datetime import time
ts[time(10, 0)]

2012-06-01 10:00:00      30
2012-06-04 10:00:00     420
2012-06-05 10:00:00     810
2012-06-06 10:00:00    1200
dtype: float64

'x.at_time：实例化方法（各时间序列以及类似的DataFrame对象都有）'
ts.at_time(time(10, 0))

2012-06-01 10:00:00      30
2012-06-04 10:00:00     420
2012-06-05 10:00:00     810
2012-06-06 10:00:00    1200
dtype: float64

'x.between_time：选取两个Time对象之间的值：'
ts.between_time(time(10, 0), time(10, 1))

2012-06-01 10:00:00      30
2012-06-01 10:01:00      31
2012-06-04 10:00:00     420
2012-06-04 10:01:00     421
2012-06-05 10:00:00     810
2012-06-05 10:01:00     811
2012-06-06 10:00:00    1200
2012-06-06 10:01:00    1201
dtype: float64

有时候，在某个具体的时间上（比如说上午10点）刚好没有数据，这时，你可能希望得到上午10点之前最后出现的那个值：

np.random.seed(12346)

# Set most of the time series randomly to NA
indexer = np.sort(np.random.permutation(len(ts))[700:])
irr_ts = ts.copy()
irr_ts[indexer] = np.nan
irr_ts['2012-06-01 09:50':'2012-06-01 10:00']

2012-06-01 09:50:00    20
2012-06-01 09:51:00   NaN
2012-06-01 09:52:00    22
2012-06-01 09:53:00    23
2012-06-01 09:54:00   NaN
2012-06-01 09:55:00    25
2012-06-01 09:56:00   NaN
2012-06-01 09:57:00   NaN
2012-06-01 09:58:00   NaN
2012-06-01 09:59:00   NaN
2012-06-01 10:00:00   NaN
dtype: float64

'x.asof: 将一组Timestamp传入asof方法，就可以得到这些时间点处（或其之前最近）的有效值（非NA）'
selection = pd.date_range('2012-06-01 10:00', periods=4, freq='B')
irr_ts.asof(selection)

2012-06-01 10:00:00      25
2012-06-04 10:00:00     420
2012-06-05 10:00:00     810
2012-06-06 10:00:00    1197
Freq: B, dtype: float64

拼接多个数据源¶

Splicing together data sources
第七章中，我介绍了一些合并两个相关数据集的办法。在金融或经济领域中，还有另外几个经常出现的情况：
1 在一个特定的时间点上，从一个数据源切换到另一个数据源。
2 用另一个时间序列对当前时间序列中的缺失值‘打补丁’
3 将数据中的符号（国家、资产代码等）替换为实际数据

'pd.concat：在特定时刻从一个时间序列切换到另一个'
data1 = DataFrame(np.ones((6, 3), dtype=float),
                  columns=['a', 'b', 'c'],
                  index=pd.date_range('6/12/2012', periods=6))
data2 = DataFrame(np.ones((6, 3), dtype=float) * 2,
                  columns=['a', 'b', 'c'],
                  index=pd.date_range('6/13/2012', periods=6))
spliced = pd.concat([data1.ix[:'2012-06-14'], data2.ix['2012-06-15':]])
spliced

'data1缺失了data2中存在的某个时间序列'
data2 = DataFrame(np.ones((6, 4), dtype=float) * 2,
                  columns=['a', 'b', 'c', 'd'],
                  index=pd.date_range('6/13/2012', periods=6))
spliced = pd.concat([data1.ix[:'2012-06-14'], data2.ix['2012-06-15':]])
spliced

'x.combine_first：可以引入合并之前的数据，这样也扩展了“d”项的历史（注意：data2没有2012-06-12那天的数据，所以那天的数据没有填充）'
spliced_filled = spliced.combine_first(data2)
spliced_filled

'DataFrame也有一个类似的update，它可以实现就地更新。如果只想填充空洞，则必须传入overwrite=False'
spliced.update(data2, overwrite=False)

spliced

'上面所讲的这些技术都可以实现将数据中的符号替换为实际数据，但有时利用DataFrame的索引机制直接对列进行设置会更简单一些：'
cp_spliced = spliced.copy()
cp_spliced[['a', 'c']] = data1[['a', 'c']]
cp_spliced

收益指数和累计收益¶

Return indexes and cumulative returns

import pandas_datareader.data as web
price = web.get_data_yahoo('AAPL', '2011-01-01', '2012-07-27')['Adj Close']
price[-5:]

Date
2012-07-23    79.885785
2012-07-24    79.500793
2012-07-25    76.067652
2012-07-26    76.055747
2012-07-27    77.415771
Name: Adj Close, dtype: float64

'计算两个时间点之间的累计百分比回报：'
price['2011-10-03'] / price['2011-3-01'] - 1

0.072399889918845162

'利用cumprod计算出一个简单的收益指数'
returns = price.pct_change()
ret_index = (1 + returns).cumprod()
ret_index[0] = 1  # Set first value to 1
ret_index

Date
2011-01-03    1.000000
2011-01-04    1.005219
2011-01-05    1.013442
2011-01-06    1.012622
2011-01-07    1.019874
2011-01-10    1.039081
                ...   
2012-07-20    1.833601
2012-07-23    1.832175
2012-07-24    1.823345
2012-07-25    1.744607
2012-07-26    1.744334
2012-07-27    1.775526
Name: Adj Close, dtype: float64

'从收益指数中计算指定时期内的累计收益'
m_returns = ret_index.resample('BM', how='last').pct_change()
m_returns['2012']

Date
2012-01-31    0.127111
2012-02-29    0.188311
2012-03-30    0.105284
2012-04-30   -0.025970
2012-05-31   -0.010702
2012-06-29    0.010853
2012-07-31    0.001986
Freq: BM, Name: Adj Close, dtype: float64

'就这个简单的例子而言（没有股息也没有其他需要考虑的调整），上面的结果也能通过重采样聚合（这里聚合为时期）从日百分比变化中计算得出：'
m_rets = (1 + returns).resample('M', how='prod', kind='period') - 1
m_rets['2012']

Date
2012-01    0.127111
2012-02    0.188311
2012-03    0.105284
2012-04   -0.025970
2012-05   -0.010702
2012-06    0.010853
2012-07    0.001986
Freq: M, Name: Adj Close, dtype: float64

分组变换和分析¶

Group transforms and analysis

pd.options.display.max_rows = 20
pd.options.display.max_columns = 10
np.random.seed(12345)

'随机生成 1000 个股票代码'
import random; random.seed(0)
import string

N = 1000
def rands(n):
    choices = string.ascii_uppercase # 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    'random.choice(): 从一个非空序列中随机选择元素'
    return ''.join([random.choice(choices) for _ in range(n)]) #返回 含有5个大写字母的字符串 即为股票代码
tickers = np.array([rands(5) for _ in range(N)]) #生成1000个股票代码

'生成3列假想的数据，只选择部分股票（500只）作为投资组合。'
M = 500
df = DataFrame({'Momentum' : np.random.randn(M) / 200 + 0.03,
                'Value' : np.random.randn(M) / 200 + 0.08,
                'ShortInterest' : np.random.randn(M) / 200 - 0.02},
                index=tickers[:M])

'随机进行行业分类：这里以2个行业为例'
ind_names = np.array(['FINANCIAL', 'TECH'])
sampler = np.random.randint(0, len(ind_names), N)
industries = Series(ind_names[sampler], index=tickers,
                    name='industry')

'根据行业分类进行分组，并执行分组聚合和变换'
by_industry = df.groupby(industries)
by_industry.mean()

by_industry.describe()

'行业内标准化处理'
# Within-Industry Standardize
def zscore(group):
    return (group - group.mean()) / group.std()

df_stand = by_industry.apply(zscore)

'处理后各行业均值为0，标准差为1'
df_stand.groupby(industries).agg(['mean', 'std'])

'内置变换函数（如rank）的用法会更简洁一些：'
# Within-industry rank descending
ind_rank = by_industry.rank(ascending=False)
ind_rank.groupby(industries).agg(['min', 'max'])

'在股票投资组合的定量分析中，‘排名和标准化’是一种很常见的变换运算组合。通过将rank和zscore连接在一起即可完成整个变换过程。'
# Industry rank and standardize
by_industry.apply(lambda x: zscore(x.rank()))

by_industry.rank() #各列的值是该列对应的排名位置

分组因子暴露¶

Group factor exposures

from numpy.random import rand
fac1, fac2, fac3 = np.random.rand(3, 1000) #构建 3 个因子，np.random.rand返回 [0,1) 之间的均匀分布

ticker_subset = tickers.take(np.random.permutation(N)[:1000]) # tickers 包含 1000 个股票的代码

# 因子加权和以及白噪声 Weighted sum of factors plus noise
port = Series(0.7 * fac1 - 1.2 * fac2 + 0.3 * fac3 + rand(1000),
              index=ticker_subset)
factors = DataFrame({'f1': fac1, 'f2': fac2, 'f3': fac3},
                    index=ticker_subset)

'各因子与投资组合之间的矢量相关性可能说明不了什么问题。'
factors.corrwith(port)

f1    0.386231
f2   -0.668214
f3    0.215809
dtype: float64

'pd.ols：最小二乘回归，用来计算整个投资组合的暴露。'
pd.ols(y=port, x=factors).beta
'可以看出，由于没有给投资组合添加过多的随机噪声，原始的因子权重基本上可算是恢复出来了。'

f1           0.721780
f2          -1.179640
f3           0.337076
intercept    0.459025
dtype: float64

'通过gruopby计算各行业的暴露量。为实现这个目的，我们先编写一个函数'
def beta_exposure(chunk, factors=None):
    return pd.ols(y=chunk, x=factors).beta

'然后根据行业进行分组，并应用该函数，传入因子载荷的DataFrame'
by_ind = port.groupby(industries) # 通过 industries 对 port 分组
exposures = by_ind.apply(beta_exposure, factors=factors)
exposures.unstack()

十分位和四分卫分析（从本节往下内容略过）。¶

Decile and quartile analysis

import pandas_datareader.data as web
data = web.get_data_yahoo('SPY', '2006-01-01','2012-07-27')
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1655 entries, 2006-01-03 to 2012-07-27
Data columns (total 6 columns):
Open         1655 non-null float64
High         1655 non-null float64
Low          1655 non-null float64
Close        1655 non-null float64
Volume       1655 non-null int64
Adj Close    1655 non-null float64
dtypes: float64(5), int64(1)
memory usage: 90.5 KB

px = data['Adj Close']
returns = px.pct_change()

def to_index(rets):
    index = (1 + rets).cumprod()
    first_loc = max(index.index.get_loc(index.idxmax()) - 1, 0)
    index.values[first_loc] = 1
    return index

def trend_signal(rets, lookback, lag):
    signal = pd.rolling_sum(rets, lookback, min_periods=lookback - 5)
    return signal.shift(lag)

signal = trend_signal(returns, 100, 3)
trade_friday = signal.resample('W-FRI').resample('B', fill_method='ffill')
trade_rets = trade_friday.shift(1) * returns
trade_rets = trade_rets[:len(returns)]

to_index(trade_rets).plot()

<matplotlib.axes._subplots.AxesSubplot at 0xb364198>

vol = pd.rolling_std(returns, 250, min_periods=200) * np.sqrt(250)

def sharpe(rets, ann=250):
    return rets.mean() / rets.std()  * np.sqrt(ann)

cats = pd.qcut(vol, 4)
print('cats: %d, trade_rets: %d, vol: %d' % (len(cats), len(trade_rets), len(vol)))

cats: 1655, trade_rets: 1655, vol: 1655

trade_rets.groupby(cats).agg(sharpe)

Adj Close
[0.0954, 0.16]   -0.012379
(0.16, 0.188]     0.278561
(0.188, 0.231]   -0.154232
(0.231, 0.457]    0.526424
Name: Adj Close, dtype: float64

More example applications¶

Signal frontier analysis¶

names = ['AAPL', 'GOOG', 'MSFT', 'DELL', 'GS', 'MS', 'BAC', 'C']
def get_px(stock, start, end):
    return web.get_data_yahoo(stock, start, end)['Adj Close']
px = DataFrame({n: get_px(n, None, None) for n in names})

#px = pd.read_csv('ch11/stock_px.csv')

plt.close('all')

px = px.asfreq('B').fillna(method='pad')
rets = px.pct_change()
((1 + rets).cumprod() - 1).plot()

<matplotlib.axes._subplots.AxesSubplot at 0xb635208>

def calc_mom(price, lookback, lag):
    mom_ret = price.shift(lag).pct_change(lookback)
    ranks = mom_ret.rank(axis=1, ascending=False)
    demeaned = ranks.subtract(ranks.mean(axis=1), axis=0)
    return demeaned.divide(demeaned.std(axis=1), axis=0)

compound = lambda x : (1 + x).prod() - 1
daily_sr = lambda x: x.mean() / x.std()

def strat_sr(prices, lb, hold):
    # Compute portfolio weights
    freq = '%dB' % hold
    port = calc_mom(prices, lb, lag=1)

    daily_rets = prices.pct_change()

    # Compute portfolio returns
    port = port.shift(1).resample(freq, how='first')
    returns = daily_rets.resample(freq, how=compound)
    port_rets = (port * returns).sum(axis=1)

    return daily_sr(port_rets) * np.sqrt(252 / hold)

strat_sr(px, 70, 30)

-0.27845051234214518

from collections import defaultdict

lookbacks = range(20, 90, 5)
holdings = range(20, 90, 5)
dd = defaultdict(dict)
for lb in lookbacks:
    for hold in holdings:
        dd[lb][hold] = strat_sr(px, lb, hold)

ddf = DataFrame(dd)
ddf.index.name = 'Holding Period'
ddf.columns.name = 'Lookback Period'

import matplotlib.pyplot as plt

def heatmap(df, cmap=plt.cm.gray_r):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    axim = ax.imshow(df.values, cmap=cmap, interpolation='nearest')
    ax.set_xlabel(df.columns.name)
    ax.set_xticks(np.arange(len(df.columns)))
    ax.set_xticklabels(list(df.columns))
    ax.set_ylabel(df.index.name)
    ax.set_yticks(np.arange(len(df.index)))
    ax.set_yticklabels(list(df.index))
    plt.colorbar(axim)

heatmap(ddf)

Future contract rolling¶

pd.options.display.max_rows = 10

import pandas.io.data as web
# Approximate price of S&P 500 index
px = web.get_data_yahoo('SPY')['Adj Close'] * 10
px

Date
2010-01-04     998.08658
2010-01-05    1000.72861
2010-01-06    1001.43318
2010-01-07    1005.66052
2010-01-08    1009.00712
                 ...    
2016-03-30    2060.20004
2016-03-31    2055.20004
2016-04-01    2069.19998
2016-04-04    2062.50000
2016-04-05    2041.90002
Name: Adj Close, dtype: float64

from datetime import datetime
expiry = {'ESU2': datetime(2012, 9, 21),
          'ESZ2': datetime(2012, 12, 21)}
expiry = Series(expiry).sort_values()

expiry

ESU2   2012-09-21
ESZ2   2012-12-21
dtype: datetime64[ns]

np.random.seed(12347)
N = 200
walk = (np.random.randint(0, 200, size=N) - 100) * 0.25
perturb = (np.random.randint(0, 20, size=N) - 10) * 0.25
walk = walk.cumsum()

rng = pd.date_range(px.index[0], periods=len(px) + N, freq='B')
near = np.concatenate([px.values, px.values[-1] + walk])
far = np.concatenate([px.values, px.values[-1] + walk + perturb])
prices = DataFrame({'ESU2': near, 'ESZ2': far}, index=rng)

prices.tail()

def get_roll_weights(start, expiry, items, roll_periods=5):
    # start : first date to compute weighting DataFrame
    # expiry : Series of ticker -> expiration dates
    # items : sequence of contract names

    dates = pd.date_range(start, expiry[-1], freq='B')
    weights = DataFrame(np.zeros((len(dates), len(items))),
                        index=dates, columns=items)

    prev_date = weights.index[0]
    for i, (item, ex_date) in enumerate(expiry.iteritems()):
        if i < len(expiry) - 1:
            weights.ix[prev_date:ex_date - pd.offsets.BDay(), item] = 1
            roll_rng = pd.date_range(end=ex_date - pd.offsets.BDay(),
                                     periods=roll_periods + 1, freq='B')

            decay_weights = np.linspace(0, 1, roll_periods + 1)
            weights.ix[roll_rng, item] = 1 - decay_weights
            weights.ix[roll_rng, expiry.index[i + 1]] = decay_weights
        else:
            weights.ix[prev_date:, item] = 1

        prev_date = ex_date

    return weights

weights = get_roll_weights('6/1/2012', expiry, prices.columns)
weights.ix['2012-09-12':'2012-09-21']

rolled_returns = (prices.pct_change() * weights).sum(1)

Rolling correlation and linear regression¶

aapl = web.get_data_yahoo('AAPL', '2000-01-01')['Adj Close']
msft = web.get_data_yahoo('MSFT', '2000-01-01')['Adj Close']

aapl_rets = aapl.pct_change()
msft_rets = msft.pct_change()

plt.figure()

<matplotlib.figure.Figure at 0xb8419e8>

<matplotlib.figure.Figure at 0xb8419e8>

pd.rolling_corr(aapl_rets, msft_rets, 250).plot()

<matplotlib.axes._subplots.AxesSubplot at 0xb862d68>

plt.figure()

<matplotlib.figure.Figure at 0xb862630>

<matplotlib.figure.Figure at 0xb862630>

model = pd.ols(y=aapl_rets, x={'MSFT': msft_rets}, window=250)
model.beta

model.beta['MSFT'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0xb704518>

		Momentum	ShortInterest	Value
industry
FINANCIAL	count	257.000000	257.000000	257.000000
	mean	0.030309	-0.019859	0.079973
	std	0.005000	0.004920	0.004730
	min	0.014715	-0.030405	0.065925
	25%	0.027442	-0.023847	0.076857
	50%	0.030183	-0.020059	0.079866
	75%	0.033918	-0.016045	0.083360
	max	0.044772	-0.007475	0.092568
TECH	count	243.000000	243.000000	243.000000
	mean	0.030339	-0.019895	0.079944
	std	0.005327	0.005152	0.005269
	min	0.011273	-0.033259	0.062859
	25%	0.027076	-0.023835	0.076912
	50%	0.030167	-0.019313	0.079856
	75%	0.034145	-0.016168	0.083262
	max	0.042123	-0.008714	0.093268

	Momentum	ShortInterest	Value
MYNBI	-0.369884	-1.095427	-0.768221
QPMZJ	0.156490	1.593348	1.593348
PLSGQ	-0.867806	-0.298753	0.668637
EJEYD	0.469469	-0.967390	-0.369884
TZIRW	0.497921	-1.138106	1.451085
ZTEJD	1.180785	0.796674	1.323048
XCVKP	-0.512148	-0.882032	-1.493764
RDLNK	1.452907	0.551567	-0.753359
TUGRP	1.607574	0.156490	0.611732
OQIBZ	-1.564895	1.707158	1.365727
...	...	...	...
PJDOE	-1.607574	0.938937	1.095427
LHXFK	-0.309415	-0.632284	0.807171
PHDVM	-0.156490	-0.085358	0.184942
MOQOS	0.711316	-0.697090	-0.910484
THVMQ	1.251916	-1.337274	0.540600
JPHKQ	-1.379953	0.953163	-0.412563
VACPK	-0.426790	0.455242	-1.024295
MHNBS	-0.384111	0.341432	-0.810900
YBNCI	-1.721964	0.834077	1.547077
GXKFD	1.579122	1.678706	1.109653

	f1	f2	f3	intercept
industry
FINANCIAL	0.756877	-1.206185	0.27380	0.489334
TECH	0.682553	-1.156423	0.40224	0.431269

	ESU2	ESZ2
2016-10-14	2071.15002	2072.90002
2016-10-17	2057.40002	2059.65002
2016-10-18	2065.40002	2067.15002
2016-10-19	2081.90002	2081.15002
2016-10-20	2061.90002	2059.65002

	MSFT	intercept
Date
2000-12-28	0.429021	-0.002113
2000-12-29	0.421103	-0.001796
2001-01-02	0.420596	-0.001839
2001-01-03	0.433292	-0.001289
2001-01-04	0.432772	-0.001307
...	...	...
2016-03-30	0.590460	-0.001149
2016-03-31	0.592755	-0.001241
2016-04-01	0.593577	-0.001211
2016-04-04	0.592711	-0.001121
2016-04-05	0.592977	-0.001125

she35

pandas笔记：ch11金融和经济数据应用