pandas笔记:ch11金融和经济数据应用
ch11
金融和经济数据应用¶
Financial and Economic Data Applications
In [1]:
from __future__ import division
from pandas import Series, DataFrame
import pandas as pd
from numpy.random import randn
import numpy as np
pd.options.display.max_rows = 12
np.set_printoptions(precision=4, suppress=True)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(12, 6))
In [2]:
%matplotlib inline
In [3]:
%pwd
Out[3]:
数据规整化方面的话题¶
Data munging topics
时间序列以及截面对齐¶
Time series and cross-section alignment
In [4]:
close_px = pd.read_csv('ch11/stock_px.csv', parse_dates=True, index_col=0)
volume = pd.read_csv('ch11/volume.csv', parse_dates=True, index_col=0)
prices = close_px.ix['2011-09-05':'2011-09-14', ['AAPL', 'JNJ', 'SPX', 'XOM']]
volume = volume.ix['2011-09-05':'2011-09-12', ['AAPL', 'JNJ', 'XOM']]
In [5]:
prices
Out[5]:
In [6]:
volume
Out[6]:
In [7]:
'pandas可以在算术运算中自动对齐数据。在实际工作当中,这不仅能为你带来极大的自由度,而且还能提高你的工作效率。'
prices * volume
Out[7]:
In [8]:
'用所有有效数据计算一个成交量加权平均价格:由于pandas会在算术运算过程中自动将数据对齐,并在sum这样的函数中排除缺失数据,所以我们只需编写'
'如下这条简洁表达式即可'
vwap = (prices * volume).sum() / volume.sum()
In [9]:
vwap
Out[9]:
In [10]:
vwap.dropna()
Out[10]:
In [11]:
'如果希望手工进行对齐,可以使用DataFrame的align方法,它返回的是一个元祖,含有两个对象的重索引版本:'
prices.align(volume, join='inner')
Out[11]:
In [12]:
'另一个不可或缺的功能是:通过索引不同的Series构建一个DataFrame'
s1 = Series(range(3), index=['a', 'b', 'c'])
s2 = Series(range(4), index=['d', 'b', 'c', 'e'])
s3 = Series(range(3), index=['f', 'a', 'c'])
DataFrame({'one': s1, 'two': s2, 'three': s3})
Out[12]:
In [13]:
'显式定义结果的索引'
DataFrame({'one': s1, 'two': s2, 'three': s3}, index=list('face'))
Out[13]:
频率不同的时间序列的运算¶
Operations with time series of different frequencies
resample用于将数据转换到固定频率。
reindex用于使数据符合一个新索引。他们都支持插值(如向前填充)逻辑。
In [4]:
ts1 = Series(np.random.randn(3),
index=pd.date_range('2012-6-13', periods=3, freq='W-WED'))
ts1
Out[4]:
In [5]:
ts1.resample('B')
Out[5]:
In [6]:
'低频到高频,填充NaN值'
ts1.resample('B', fill_method='ffill')
Out[6]:
In [7]:
dates = pd.DatetimeIndex(['2012-6-12', '2012-6-17', '2012-6-18',
'2012-6-21', '2012-6-22', '2012-6-29'])
ts2 = Series(np.random.randn(6), index=dates)
ts2
Out[7]:
In [8]:
'对 ts1 采集 ts2.index 对应的数据'
ts1.reindex(ts2.index, method='ffill')
Out[8]:
In [9]:
'ts1 + ts2 并维持index 为 ts2.index '
ts2 + ts1.reindex(ts2.index, method='ffill')
Out[9]:
使用Period代替timestamps¶
Using periods instead of timestamps
In [10]:
'gdp 和 infl(通货膨胀) 的宏观经济时间序列'
gdp = Series([1.78, 1.94, 2.08, 2.01, 2.15, 2.31, 2.46],
index=pd.period_range('1984Q2', periods=7, freq='Q-SEP'))
infl = Series([0.025, 0.045, 0.037, 0.04],
index=pd.period_range('1982', periods=4, freq='A-DEC'))
gdp
Out[10]:
In [11]:
infl
Out[11]:
In [12]:
'跟Timestamp的时间序列不同,Period索引的两个不同频率的时间序列之间的运算必须进行显式转换。'
'本例中已知 infl 是每年年末观测的,那么可以把它转换为 gdp 的频率'
infl_q = infl.asfreq('Q-SEP', how='end')
In [13]:
infl_q
Out[13]:
In [14]:
'现在可以对 infl 重新索引了'
infl_q.reindex(gdp.index, method='ffill')
Out[14]:
时间和‘最当前’的数据选取¶
Time of day and "as of" data selection
In [27]:
'生成一个交易日内的日期范围和时间序列索引'
rng = pd.date_range('2012-06-01 09:30', '2012-06-01 15:59', freq='T')
rng
Out[27]:
In [28]:
'两个工作日'
pd.offsets.BDay(2)
Out[28]:
In [29]:
'三个工作日的日期范围和时间序列索引'
[rng + pd.offsets.BDay(i) for i in range(1, 4)]
Out[29]:
In [30]:
'5天的日期范围和时间序列索引 (这里我只看到4天的) Make a 5-day series of 9:30-15:59 values'
rng = rng.append([rng + pd.offsets.BDay(i) for i in range(1, 4)])
rng
Out[30]:
In [23]:
ts = Series(np.arange(len(rng), dtype=float), index=rng)
ts
Out[23]:
In [16]:
'datetime.time:抽出固定时间点上的值'
from datetime import time
ts[time(10, 0)]
Out[16]:
In [17]:
'x.at_time:实例化方法(各时间序列以及类似的DataFrame对象都有)'
ts.at_time(time(10, 0))
Out[17]:
In [18]:
'x.between_time:选取两个Time对象之间的值:'
ts.between_time(time(10, 0), time(10, 1))
Out[18]:
有时候,在某个具体的时间上(比如说上午10点)刚好没有数据,这时,你可能希望得到上午10点之前最后出现的那个值:
In [19]:
np.random.seed(12346)
In [20]:
# Set most of the time series randomly to NA
indexer = np.sort(np.random.permutation(len(ts))[700:])
irr_ts = ts.copy()
irr_ts[indexer] = np.nan
irr_ts['2012-06-01 09:50':'2012-06-01 10:00']
Out[20]:
In [21]:
'x.asof: 将一组Timestamp传入asof方法,就可以得到这些时间点处(或其之前最近)的有效值(非NA)'
selection = pd.date_range('2012-06-01 10:00', periods=4, freq='B')
irr_ts.asof(selection)
Out[21]:
拼接多个数据源¶
Splicing together data sources
第七章中,我介绍了一些合并两个相关数据集的办法。在金融或经济领域中,还有另外几个经常出现的情况:
1 在一个特定的时间点上,从一个数据源切换到另一个数据源。
2 用另一个时间序列对当前时间序列中的缺失值‘打补丁’
3 将数据中的符号(国家、资产代码等)替换为实际数据
In [31]:
'pd.concat:在特定时刻从一个时间序列切换到另一个'
data1 = DataFrame(np.ones((6, 3), dtype=float),
columns=['a', 'b', 'c'],
index=pd.date_range('6/12/2012', periods=6))
data2 = DataFrame(np.ones((6, 3), dtype=float) * 2,
columns=['a', 'b', 'c'],
index=pd.date_range('6/13/2012', periods=6))
spliced = pd.concat([data1.ix[:'2012-06-14'], data2.ix['2012-06-15':]])
spliced
Out[31]:
In [32]:
'data1缺失了data2中存在的某个时间序列'
data2 = DataFrame(np.ones((6, 4), dtype=float) * 2,
columns=['a', 'b', 'c', 'd'],
index=pd.date_range('6/13/2012', periods=6))
spliced = pd.concat([data1.ix[:'2012-06-14'], data2.ix['2012-06-15':]])
spliced
Out[32]:
In [33]:
'x.combine_first:可以引入合并之前的数据,这样也扩展了“d”项的历史(注意:data2没有2012-06-12那天的数据,所以那天的数据没有填充)'
spliced_filled = spliced.combine_first(data2)
spliced_filled
Out[33]:
In [34]:
'DataFrame也有一个类似的update,它可以实现就地更新。如果只想填充空洞,则必须传入overwrite=False'
spliced.update(data2, overwrite=False)
In [35]:
spliced
Out[35]:
In [36]:
'上面所讲的这些技术都可以实现将数据中的符号替换为实际数据,但有时利用DataFrame的索引机制直接对列进行设置会更简单一些:'
cp_spliced = spliced.copy()
cp_spliced[['a', 'c']] = data1[['a', 'c']]
cp_spliced
Out[36]:
收益指数和累计收益¶
Return indexes and cumulative returns
In [42]:
import pandas_datareader.data as web
price = web.get_data_yahoo('AAPL', '2011-01-01', '2012-07-27')['Adj Close']
price[-5:]
Out[42]:
In [43]:
'计算两个时间点之间的累计百分比回报:'
price['2011-10-03'] / price['2011-3-01'] - 1
Out[43]:
In [44]:
'利用cumprod计算出一个简单的收益指数'
returns = price.pct_change()
ret_index = (1 + returns).cumprod()
ret_index[0] = 1 # Set first value to 1
ret_index
Out[44]:
In [45]:
'从收益指数中计算指定时期内的累计收益'
m_returns = ret_index.resample('BM', how='last').pct_change()
m_returns['2012']
Out[45]:
In [46]:
'就这个简单的例子而言(没有股息也没有其他需要考虑的调整),上面的结果也能通过重采样聚合(这里聚合为时期)从日百分比变化中计算得出:'
m_rets = (1 + returns).resample('M', how='prod', kind='period') - 1
m_rets['2012']
Out[46]:
分组变换和分析¶
Group transforms and analysis
In [61]:
pd.options.display.max_rows = 20
pd.options.display.max_columns = 10
np.random.seed(12345)
In [50]:
'随机生成 1000 个股票代码'
import random; random.seed(0)
import string
N = 1000
def rands(n):
choices = string.ascii_uppercase # 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
'random.choice(): 从一个非空序列中随机选择元素'
return ''.join([random.choice(choices) for _ in range(n)]) #返回 含有5个大写字母的字符串 即为股票代码
tickers = np.array([rands(5) for _ in range(N)]) #生成1000个股票代码
In [51]:
'生成3列假想的数据,只选择部分股票(500只)作为投资组合。'
M = 500
df = DataFrame({'Momentum' : np.random.randn(M) / 200 + 0.03,
'Value' : np.random.randn(M) / 200 + 0.08,
'ShortInterest' : np.random.randn(M) / 200 - 0.02},
index=tickers[:M])
In [52]:
'随机进行行业分类:这里以2个行业为例'
ind_names = np.array(['FINANCIAL', 'TECH'])
sampler = np.random.randint(0, len(ind_names), N)
industries = Series(ind_names[sampler], index=tickers,
name='industry')
In [53]:
'根据行业分类进行分组,并执行分组聚合和变换'
by_industry = df.groupby(industries)
by_industry.mean()
Out[53]:
In [54]:
by_industry.describe()
Out[54]:
In [55]:
'行业内标准化处理'
# Within-Industry Standardize
def zscore(group):
return (group - group.mean()) / group.std()
df_stand = by_industry.apply(zscore)
In [56]:
'处理后各行业均值为0,标准差为1'
df_stand.groupby(industries).agg(['mean', 'std'])
Out[56]:
In [57]:
'内置变换函数(如rank)的用法会更简洁一些:'
# Within-industry rank descending
ind_rank = by_industry.rank(ascending=False)
ind_rank.groupby(industries).agg(['min', 'max'])
Out[57]:
In [62]:
'在股票投资组合的定量分析中,‘排名和标准化’是一种很常见的变换运算组合。通过将rank和zscore连接在一起即可完成整个变换过程。'
# Industry rank and standardize
by_industry.apply(lambda x: zscore(x.rank()))
Out[62]:
In [64]:
by_industry.rank() #各列的值是该列对应的排名位置
Out[64]:
分组因子暴露¶
Group factor exposures
In [65]:
from numpy.random import rand
fac1, fac2, fac3 = np.random.rand(3, 1000) #构建 3 个因子,np.random.rand返回 [0,1) 之间的均匀分布
ticker_subset = tickers.take(np.random.permutation(N)[:1000]) # tickers 包含 1000 个股票的代码
# 因子加权和以及白噪声 Weighted sum of factors plus noise
port = Series(0.7 * fac1 - 1.2 * fac2 + 0.3 * fac3 + rand(1000),
index=ticker_subset)
factors = DataFrame({'f1': fac1, 'f2': fac2, 'f3': fac3},
index=ticker_subset)
In [66]:
'各因子与投资组合之间的矢量相关性可能说明不了什么问题。'
factors.corrwith(port)
Out[66]:
In [67]:
'pd.ols:最小二乘回归,用来计算整个投资组合的暴露。'
pd.ols(y=port, x=factors).beta
'可以看出,由于没有给投资组合添加过多的随机噪声,原始的因子权重基本上可算是恢复出来了。'
Out[67]:
In [68]:
'通过gruopby计算各行业的暴露量。为实现这个目的,我们先编写一个函数'
def beta_exposure(chunk, factors=None):
return pd.ols(y=chunk, x=factors).beta
In [69]:
'然后根据行业进行分组,并应用该函数,传入因子载荷的DataFrame'
by_ind = port.groupby(industries) # 通过 industries 对 port 分组
exposures = by_ind.apply(beta_exposure, factors=factors)
exposures.unstack()
Out[69]:
十分位和四分卫分析(从本节往下内容略过)。¶
Decile and quartile analysis
In [74]:
import pandas_datareader.data as web
data = web.get_data_yahoo('SPY', '2006-01-01','2012-07-27')
data.info()
In [75]:
px = data['Adj Close']
returns = px.pct_change()
def to_index(rets):
index = (1 + rets).cumprod()
first_loc = max(index.index.get_loc(index.idxmax()) - 1, 0)
index.values[first_loc] = 1
return index
def trend_signal(rets, lookback, lag):
signal = pd.rolling_sum(rets, lookback, min_periods=lookback - 5)
return signal.shift(lag)
In [76]:
signal = trend_signal(returns, 100, 3)
trade_friday = signal.resample('W-FRI').resample('B', fill_method='ffill')
trade_rets = trade_friday.shift(1) * returns
trade_rets = trade_rets[:len(returns)]
In [77]:
to_index(trade_rets).plot()
Out[77]:
In [78]:
vol = pd.rolling_std(returns, 250, min_periods=200) * np.sqrt(250)
def sharpe(rets, ann=250):
return rets.mean() / rets.std() * np.sqrt(ann)
In [79]:
cats = pd.qcut(vol, 4)
print('cats: %d, trade_rets: %d, vol: %d' % (len(cats), len(trade_rets), len(vol)))
In [80]:
trade_rets.groupby(cats).agg(sharpe)
Out[80]:
More example applications¶
Signal frontier analysis¶
In [81]:
names = ['AAPL', 'GOOG', 'MSFT', 'DELL', 'GS', 'MS', 'BAC', 'C']
def get_px(stock, start, end):
return web.get_data_yahoo(stock, start, end)['Adj Close']
px = DataFrame({n: get_px(n, None, None) for n in names})
In [82]:
#px = pd.read_csv('ch11/stock_px.csv')
In [83]:
plt.close('all')
In [84]:
px = px.asfreq('B').fillna(method='pad')
rets = px.pct_change()
((1 + rets).cumprod() - 1).plot()
Out[84]:
In [85]:
def calc_mom(price, lookback, lag):
mom_ret = price.shift(lag).pct_change(lookback)
ranks = mom_ret.rank(axis=1, ascending=False)
demeaned = ranks.subtract(ranks.mean(axis=1), axis=0)
return demeaned.divide(demeaned.std(axis=1), axis=0)
In [86]:
compound = lambda x : (1 + x).prod() - 1
daily_sr = lambda x: x.mean() / x.std()
def strat_sr(prices, lb, hold):
# Compute portfolio weights
freq = '%dB' % hold
port = calc_mom(prices, lb, lag=1)
daily_rets = prices.pct_change()
# Compute portfolio returns
port = port.shift(1).resample(freq, how='first')
returns = daily_rets.resample(freq, how=compound)
port_rets = (port * returns).sum(axis=1)
return daily_sr(port_rets) * np.sqrt(252 / hold)
In [87]:
strat_sr(px, 70, 30)
Out[87]:
In [88]:
from collections import defaultdict
lookbacks = range(20, 90, 5)
holdings = range(20, 90, 5)
dd = defaultdict(dict)
for lb in lookbacks:
for hold in holdings:
dd[lb][hold] = strat_sr(px, lb, hold)
ddf = DataFrame(dd)
ddf.index.name = 'Holding Period'
ddf.columns.name = 'Lookback Period'
In [91]:
import matplotlib.pyplot as plt
def heatmap(df, cmap=plt.cm.gray_r):
fig = plt.figure()
ax = fig.add_subplot(111)
axim = ax.imshow(df.values, cmap=cmap, interpolation='nearest')
ax.set_xlabel(df.columns.name)
ax.set_xticks(np.arange(len(df.columns)))
ax.set_xticklabels(list(df.columns))
ax.set_ylabel(df.index.name)
ax.set_yticks(np.arange(len(df.index)))
ax.set_yticklabels(list(df.index))
plt.colorbar(axim)
In [92]:
heatmap(ddf)
Future contract rolling¶
In [93]:
pd.options.display.max_rows = 10
In [94]:
import pandas.io.data as web
# Approximate price of S&P 500 index
px = web.get_data_yahoo('SPY')['Adj Close'] * 10
px
Out[94]:
In [96]:
from datetime import datetime
expiry = {'ESU2': datetime(2012, 9, 21),
'ESZ2': datetime(2012, 12, 21)}
expiry = Series(expiry).sort_values()
In [97]:
expiry
Out[97]:
In [98]:
np.random.seed(12347)
N = 200
walk = (np.random.randint(0, 200, size=N) - 100) * 0.25
perturb = (np.random.randint(0, 20, size=N) - 10) * 0.25
walk = walk.cumsum()
rng = pd.date_range(px.index[0], periods=len(px) + N, freq='B')
near = np.concatenate([px.values, px.values[-1] + walk])
far = np.concatenate([px.values, px.values[-1] + walk + perturb])
prices = DataFrame({'ESU2': near, 'ESZ2': far}, index=rng)
In [99]:
prices.tail()
Out[99]:
In [100]:
def get_roll_weights(start, expiry, items, roll_periods=5):
# start : first date to compute weighting DataFrame
# expiry : Series of ticker -> expiration dates
# items : sequence of contract names
dates = pd.date_range(start, expiry[-1], freq='B')
weights = DataFrame(np.zeros((len(dates), len(items))),
index=dates, columns=items)
prev_date = weights.index[0]
for i, (item, ex_date) in enumerate(expiry.iteritems()):
if i < len(expiry) - 1:
weights.ix[prev_date:ex_date - pd.offsets.BDay(), item] = 1
roll_rng = pd.date_range(end=ex_date - pd.offsets.BDay(),
periods=roll_periods + 1, freq='B')
decay_weights = np.linspace(0, 1, roll_periods + 1)
weights.ix[roll_rng, item] = 1 - decay_weights
weights.ix[roll_rng, expiry.index[i + 1]] = decay_weights
else:
weights.ix[prev_date:, item] = 1
prev_date = ex_date
return weights
In [101]:
weights = get_roll_weights('6/1/2012', expiry, prices.columns)
weights.ix['2012-09-12':'2012-09-21']
Out[101]:
In [102]:
rolled_returns = (prices.pct_change() * weights).sum(1)
Rolling correlation and linear regression¶
In [103]:
aapl = web.get_data_yahoo('AAPL', '2000-01-01')['Adj Close']
msft = web.get_data_yahoo('MSFT', '2000-01-01')['Adj Close']
aapl_rets = aapl.pct_change()
msft_rets = msft.pct_change()
In [104]:
plt.figure()
Out[104]:
In [105]:
pd.rolling_corr(aapl_rets, msft_rets, 250).plot()
Out[105]:
In [106]:
plt.figure()
Out[106]:
In [107]:
model = pd.ols(y=aapl_rets, x={'MSFT': msft_rets}, window=250)
model.beta
Out[107]:
In [108]:
model.beta['MSFT'].plot()
Out[108]:
In [ ]:

浙公网安备 33010602011771号