【机器学习】使用逻辑回归预测 IPO 市场

使用逻辑回归预测 IPO 市场

1 IPO 市场

在开始建模之前，先讨论一下什么是 IPO（或首次公开募股），以及关于这个市场，研究的结果能了解些什么。之后可以指定一些可以应用的策略。

1.1 什么是 IPO

首次公开募股是一家私人公司成为上市公司的过程。公开发行为公司募集资金，并让公众通过购买其股票，获得投资该公司的机会。虽然具体实施有些不同，但在典型的发行过程中，一家公司会列出一家或多家承销其发行的投资银行。这意味着那些银行向公司保证，在发行当天他们将购买所有以 IPO 价格提供的股份。当然，承销的银行不打算自己保留全部的股份。在发行公司的帮助下，他们去做所谓的路演，吸引机构客户的兴趣。这些客户可以预订股份，表示他们有意在 IPO 当天购买股票。这是一个非约束性合同，因为发行的价格直到 IPO 的当天才最终确定。然后，承销商将根据客户们所表达的感兴趣程度，设定发行的价格。非常有趣的地方在于：研究表明 IPO 一直被系统性地低估。有许多理论解释为什么会发生这种情况，以及为什么低估的范围会随着时间而变化，不过可以肯定的是，研究已经显示出每年有“数十亿美元留在桌子上”。在 IPO 中，“留在桌子上的钱”是指股票的发行价和第一天收盘价之间的差价。还应该谈一谈发行价和开盘价之间的区别。虽然偶然的情况下你可以通过经纪人的交易，以发行价获得 IPO，但作为一个普通的公众，你基本上不得不以开盘价（通常更高）来购买 IPO。我们将在这个假设下构建模型。
将从 IPOScoop.com 拉取数据。这是一项为即将到来的 IPO 提供评级的服务。请访问https://www.iposcoop.com/scoop-track-record-from-2000-to-present 并单击页面底部的按钮，下载一个电子表格。

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from patsy import dmatrix 
from sklearn.ensemble import RandomForestClassifier 
from sklearn import linear_model 
%matplotlib inline 
ipos = pd.read_csv(r'data/ipo_data.csv', encoding='latin-1') 
ipos.head()

对于每个 IPO 都有一些不少的信息：发行日期、发行者、发行价格、开盘价格以及价格的变化。先按照年份来探索表现的数据。
首先需要进行一些清理工作，以正确地格式化所有的列。这里将去掉美元和百分比符号。这样的操作在机器学习中是必不可少的。在进行模型训练前，最重要的工作便是数据处理。数据处理的好坏决定了模型的好坏。可以说在一次机器学习的过程中数据处理的重要性占比80%-90%.

ipos = ipos.applymap(lambda x: x if not '($' in str(x) else x.replace('($','-'))
ipos = ipos.applymap(lambda x: x if not ')' in str(x) else x.replace(')',''))
ipos = ipos.applymap(lambda x: x if not '$' in str(x) else x.replace('$',''))
ipos = ipos.applymap(lambda x: x if not '%' in str(x) else x.replace('%',''))
ipos

接下来，修正所有列的数据类型。目前它们都是对象，但是对于即将执行的聚合和其他操作而言需要数值类型。使用下面这行代码来查看列的数据类型。

ipos.info()

在数据中有一些'N/C'的值，首先需要将其替换。之后就可以更改数据类型了

ipos.replace('N/C', 0, inplace=True)
ipos['Date'] = pd.to_datetime(ipos['Date'])
ipos['Offer Price'] = ipos['Offer Price'].astype('float')
ipos['Opening Price'] = ipos['Opening Price'].astype('float')
ipos['1st Day Close'] = ipos['1st Day Close'].astype('float')
ipos['1st Day % Px Chng '] = ipos['1st Day % Px Chng '].astype('float')
ipos['$ Change Close'] = ipos['$ Change Close'].astype('float')
ipos['$ Change Opening'] = ipos['$ Change Opening'].astype('float')
ipos['Star Ratings'] = ipos['Star Ratings'].astype('int')
ipos.info()

这里从第一天的平均收益百分比开始

ipos.groupby(ipos['Date'].dt.year)['1st Day % Px Chng '].mean().plot(
    kind='bar',
    figsize=(15,10), color='k', 
     title='1st Day Mean IPO Percentage Change')

这里都是近年来一些正向的百分比。现在来看看与平均值相比较，中位数的表现又是如何。

ipos.groupby(ipos['Date'].dt.year)['1st Day % Px Chng '].median().plot(
    kind='bar', figsize=(15,10), color='k', title='1st Day Median IPO Percentage Change')

通过平均值和中位数的对比，一些较大的异常值造成了回报分布的偏斜。

count    2692.000000
mean       13.345962
std        28.058242
min       -35.220000
25%         0.000000
50%         4.615000
75%        19.050000
max       353.850000
Name: 1st Day % Px Chng , dtype: float64

现在将其绘制成图。

ipos['1st Day % Px Chng '].hist(figsize=(15,7), bins=100, color='grey')

们可以看到大多数回报集中在零附近，但有个长尾一直拖到右侧，那里有一些真正的全垒打发行价。
看过第一天的百分比变化，就是从发行价到当天收盘价的差距，但正如前面所指出的，很少有机会能够以发行价买入。既然如此，现在看看开盘价到收盘价的收益率。它有助于理解这个问题：所有的收益都是给了那些拿到发行价的人，还是说在第一天人们仍然有机会冲入并获得超高的回报？

为了回答这个问题，首先创建两个新的列。

ipos['$ Chg Open to Close'] = ipos['$ Change Close'] - ipos['$ Change Opening'] 
ipos['% Chg Open to Close'] = (ipos['$ Chg Open to Close']/ipos['Opening Price']) * 100

将生成统计信息。

ipos['% Chg Open to Close'].describe()

输出：


count    2692.000000
mean        1.410962
std        10.847719
min       -49.281222
25%        -2.821365
50%         0.000000
75%         4.111604
max       159.417476
Name: % Chg Open to Close, dtype: float64

这些数据看起来就令人怀疑了。虽然首次公开募股有可能在开盘后下跌，但是跌幅几乎达到 99%，似乎是不太现实的。发现好像两个表现最差的发行者实际上是不好的数据点。

ipos.loc[440, '$ Change Opening'] = .09 
ipos.loc[1264, '$ Change Opening'] = .01 
ipos.loc[1264, 'Opening Price'] = 11.26 
ipos['$ Chg Open to Close'] = ipos['$ Change Close'] - ipos['$ Change Opening']  
ipos['% Chg Open to Close'] = (ipos['$ Chg Open to Close']/ipos['Opening Price']) * 100 
ipos['% Chg Open to Close'].describe()

count    2692.000000
mean        1.422169
std        10.858504
min       -49.281222
25%        -2.821365
50%         0.000000
75%         4.119485
max       159.417476
Name: % Chg Open to Close, dtype: float64

这次损失下降到 40％，看起来仍然让人觉得怀疑，不过仔细观察之后，发现它是 Zillow的 IPO。Zillow 开盘炒得异常火热，但在收盘前很快就跌到了地板上。

ipos['% Chg Open to Close'].hist(figsize=(15,7), bins=100, color='grey')

可以看到开盘价到收盘价变化的分布形状，和发行价到收盘价变化的分布相比，有着明显的差异。平均值和中位值都有显著的下降，而且紧贴着原点右侧的条形看上去有一个健康的梯度，而原点左侧的条形似乎也按照比例进行了增长。

1.2 基本的 IPO 策略

以其开盘价购买每个IPO 股票，然后在收盘时卖出，那么最终收益如何？

ipos[ipos['Date']>='2015-01-01']['$ Chg Open to Close'].describe()

count    173.000000
mean       0.272543
std        2.652583
min       -6.160000
25%       -0.720000
50%        0.000000
75%        0.700000
max       20.040000
Name: $ Chg Open to Close, dtype: float64

ipos[ipos['Date']>='2015-01-01']['$ Chg Open to Close'].sum()# 47.15

拆分一下盈利的交易和亏损的交易。

ipos[(ipos['Date']>='2015-01-01')&(ipos['$ Chg Open to Close']>0)]['$ Chg Open to Close'].describe()

count    86.000000
mean      1.646512
std       2.934752
min       0.010000
25%       0.260000
50%       0.700000
75%       1.650000
max      20.040000
Name: $ Chg Open to Close, dtype: float64

ipos[(ipos['Date']>='2015-01-01')&(ipos['$ Chg Open to Close']<0)]['$ Chg Open to Close'].describe()

count    78.000000
mean     -1.210897
std       1.365035
min      -6.160000
25%      -1.470000
50%      -0.840000
75%      -0.212500
max      -0.010000
Name: $ Chg Open to Close, dtype: float64

可以看到，如果 2015 年投资每一个 IPO，将会忙于投资 147 家 IPO，大约一半挣钱，而另一半损失了钱。整体上还是有利润的，因为盈利 IPO 的收益最终弥补了损失的钱。当然这里假设没有交易差额或佣金成本，在现实世界中这些都是不可避免的。然而，这显然不是发家致富的法宝，因为平均回报率低于 1%。
是否可以使用机器学习来帮助改善这个最基本的方法。

2 特征工程

在交易开始后，什么会影响股票的表现？最近整体市场的表现，或者是承销商的威望都可能会影响它。也许交易日的星期几或月份很重要。在模型中考虑和囊括这些因素被称为特征工程，而且特征的建模几乎和用于构建模型的数据一样重要。如果你的特征没有信息含量，那么模型根本不会有价值。
先从获取标普 500 指数的数据开始。这可能是普通美国市场最好的代表。可以从Yahoo! Finance 下载，网址是https://finance.yahoo.com/q/hp?s=%5EGSPC&a=00&b=1&c=2000&d=11&e=17&f=2015&g=d。然后，我们可以使用 pandas 导入数据。

sp = pd.read_csv(r'data/spy.csv') 
sp.sort_values('Date', inplace=True) 
sp.reset_index(drop=True, inplace=True) 
sp

因为整体市场在过去一周的表现会在逻辑上影响某个股票，因此将其添加到这里的 DataFrame 中。将计算标普 500 昨日收盘价相对于其七天前收盘价的变化百分比。

def get_week_chg(ipo_dt): 
    try: 
        day_ago_idx = sp[sp['Date']==str(ipo_dt.date())].index[0] - 1 
        week_ago_idx = sp[sp['Date']==str(ipo_dt.date())].index[0] - 8 
        chg = (sp.iloc[day_ago_idx]['Close'] - sp.iloc[week_ago_idx]['Close'])/(sp.iloc[week_ago_idx]['Close']) 
        return chg * 100 
    except: 
        print('error', ipo_dt.date()) 
ipos['SP Week Change'] = ipos['Date'].map(get_week_chg)

输出：
error 2015-02-21
error 2015-02-21
error 2013-11-16
error 2009-08-01
运行代码后，系统提示有几个日期对应的数据执行失败了，这表明 IPO 的日期可能存在一些错误。检查这些日期相关的 IPO 发现它们当天是关闭的状态。

def get_week_chg(ipo_dt): 
    try: 
        day_ago_idx = sp[sp['Date']==str(ipo_dt.date())].index[0] - 1 
        week_ago_idx = sp[sp['Date']==str(ipo_dt.date())].index[0] - 8 
        chg = (sp.iloc[day_ago_idx]['Close'] - sp.iloc[week_ago_idx]['Close'])/(sp.iloc[week_ago_idx]['Close']) 
        return chg * 100 
    except: 
        print('error', ipo_dt.date()) 

ipos.loc[1155, 'Date'] = pd.to_datetime('2009-08-12')
ipos.loc[656, 'Date'] = pd.to_datetime('2012-11-20')
ipos.loc[27, 'Date'] = pd.to_datetime('2015-05-21')
ipos.loc[28, 'Date'] = pd.to_datetime('2015-05-21')       
ipos['SP Week Change'] = ipos['Date'].map(get_week_chg)

添加一项新的指标，即标准普尔 500 指数在 IPO 前一天收盘时到 IPO 首日开盘时这个期间内，变化的百分比。

def get_cto_chg(ipo_dt): 
    try: 
        today_open_idx = sp[sp['Date']==str(ipo_dt.date())].index[0] 
        yday_close_idx = sp[sp['Date']==str(ipo_dt.date())].index[0] - 1 
        chg = (sp.iloc[today_open_idx]['Open'] -sp.iloc[yday_close_idx]['Close'])/(sp.iloc[yday_close_idx]['Close']) 
        return chg * 100 
    except: 
        print('error', ipo_dt) 
ipos['SP Close to Open Chg Pct'] = ipos['Date'].map(get_cto_chg)

整理承销商的数据。这需要一些工作量。将执行一系列的步骤。
首先，为主承销商添加一列。接下来，会对数据进行标准化。最后将添加一列，表示参与承销商的总数。
首先，通过数据中字符串的拆分和空格的删除，解析出主承销商。

ipos['Lead Mgr'] = ipos['Lead/Joint-Lead Managers'].map(lambda x: x.split('/')[0])
ipos['Lead Mgr'] = ipos['Lead Mgr'].map(lambda x: x.strip())

打印出不同的主承销商，这样可以看出为了规范银行的名称，需要进行多少清理工作。

unique_lead_mgr = pd.DataFrame(ipos['Lead Mgr'].unique(), columns=['Name'])

# 按字母顺序对唯一值进行排序
sorted_lead_mgr = unique_lead_mgr.sort_values(by='Name')

# 遍历排序后的唯一值，并逐个打印出来
for n in sorted_lead_mgr['Name']:
    print(n)

有两种方法可以做到这一点。第一种方法，毫无疑问是两个方法中更容易的那个，就是所做的工作，只是复制和粘贴下面的代码。另一种方法是执行大量迭代的字符串部分匹配，并且由自己来纠正。强烈建议使用第一种选项。

ipos.loc[ipos['Lead Mgr'].str.contains('Hambrecht'),'Lead Mgr'] = 'WR Hambrecht+Co.'
ipos.loc[ipos['Lead Mgr'].str.contains('Edwards'), 'Lead Mgr'] = 'AG Edwards'
ipos.loc[ipos['Lead Mgr'].str.contains('Edwrads'), 'Lead Mgr'] = 'AG Edwards'
ipos.loc[ipos['Lead Mgr'].str.contains('Barclay'), 'Lead Mgr'] = 'Barclays'
ipos.loc[ipos['Lead Mgr'].str.contains('Aegis'), 'Lead Mgr'] = 'Aegis Capital'
ipos.loc[ipos['Lead Mgr'].str.contains('Deutsche'), 'Lead Mgr'] = 'Deutsche Bank'
ipos.loc[ipos['Lead Mgr'].str.contains('Suisse'), 'Lead Mgr'] = 'CSFB'
ipos.loc[ipos['Lead Mgr'].str.contains('CS.?F'), 'Lead Mgr'] = 'CSFB'
ipos.loc[ipos['Lead Mgr'].str.contains('^Early'), 'Lead Mgr'] = 'EarlyBirdCapital'
ipos.loc[325,'Lead Mgr'] = 'Maximum Captial'
ipos.loc[ipos['Lead Mgr'].str.contains('Keefe'), 'Lead Mgr'] = 'Keefe, Bruyette & Woods'
ipos.loc[ipos['Lead Mgr'].str.contains('Stan'), 'Lead Mgr'] = 'Morgan Stanley'
ipos.loc[ipos['Lead Mgr'].str.contains('P. Morg'), 'Lead Mgr'] = 'JP Morgan'
ipos.loc[ipos['Lead Mgr'].str.contains('PM'), 'Lead Mgr'] = 'JP Morgan'
ipos.loc[ipos['Lead Mgr'].str.contains('J\.P\.'), 'Lead Mgr'] = 'JP Morgan'
ipos.loc[ipos['Lead Mgr'].str.contains('Banc of'), 'Lead Mgr'] = 'Banc of America'
ipos.loc[ipos['Lead Mgr'].str.contains('Lych'), 'Lead Mgr'] = 'BofA Merrill Lynch'
ipos.loc[ipos['Lead Mgr'].str.contains('Merrill$'), 'Lead Mgr'] = 'Merrill Lynch'
ipos.loc[ipos['Lead Mgr'].str.contains('Lymch'), 'Lead Mgr'] = 'Merrill Lynch'
ipos.loc[ipos['Lead Mgr'].str.contains('A Merril Lynch'), 'Lead Mgr'] = 'BofA Merrill Lynch'
ipos.loc[ipos['Lead Mgr'].str.contains('Merril '), 'Lead Mgr'] = 'Merrill Lynch'
ipos.loc[ipos['Lead Mgr'].str.contains('BofA$'), 'Lead Mgr'] = 'BofA Merrill Lynch'
ipos.loc[ipos['Lead Mgr'].str.contains('SANDLER'), 'Lead Mgr'] = 'Sandler O\'neil + Partners'
ipos.loc[ipos['Lead Mgr'].str.contains('Sandler'), 'Lead Mgr'] = 'Sandler O\'Neil + Partners'
ipos.loc[ipos['Lead Mgr'].str.contains('Renshaw'), 'Lead Mgr'] = 'Rodman & Renshaw'
ipos.loc[ipos['Lead Mgr'].str.contains('Baird'), 'Lead Mgr'] = 'RW Baird'
ipos.loc[ipos['Lead Mgr'].str.contains('Cantor'), 'Lead Mgr'] = 'Cantor Fitzgerald'
ipos.loc[ipos['Lead Mgr'].str.contains('Goldman'), 'Lead Mgr'] = 'Goldman Sachs'
ipos.loc[ipos['Lead Mgr'].str.contains('Bear'), 'Lead Mgr'] = 'Bear Stearns'
ipos.loc[ipos['Lead Mgr'].str.contains('BoA'), 'Lead Mgr'] = 'BofA Merrill Lynch'
ipos.loc[ipos['Lead Mgr'].str.contains('Broadband'), 'Lead Mgr'] = 'Broadband Capital'
ipos.loc[ipos['Lead Mgr'].str.contains('Davidson'), 'Lead Mgr'] = 'DA Davidson'
ipos.loc[ipos['Lead Mgr'].str.contains('Feltl'), 'Lead Mgr'] = 'Feltl & Co.'
ipos.loc[ipos['Lead Mgr'].str.contains('China'), 'Lead Mgr'] = 'China International'
ipos.loc[ipos['Lead Mgr'].str.contains('Cit'), 'Lead Mgr'] = 'Citigroup'
ipos.loc[ipos['Lead Mgr'].str.contains('Ferris'), 'Lead Mgr'] = 'Ferris Baker Watts'
ipos.loc[ipos['Lead Mgr'].str.contains('Friedman|Freidman|FBR'), 'Lead Mgr'] = 'Friedman Billings Ramsey'
ipos.loc[ipos['Lead Mgr'].str.contains('^I-'), 'Lead Mgr'] = 'I-Bankers'
ipos.loc[ipos['Lead Mgr'].str.contains('Gunn'), 'Lead Mgr'] = 'Gunn Allen'
ipos.loc[ipos['Lead Mgr'].str.contains('Jeffer'), 'Lead Mgr'] = 'Jefferies'
ipos.loc[ipos['Lead Mgr'].str.contains('Oppen'), 'Lead Mgr'] = 'Oppenheimer'
ipos.loc[ipos['Lead Mgr'].str.contains('JMP'), 'Lead Mgr'] = 'JMP Securities'
ipos.loc[ipos['Lead Mgr'].str.contains('Rice'), 'Lead Mgr'] = 'Johnson Rice'
ipos.loc[ipos['Lead Mgr'].str.contains('Ladenburg'), 'Lead Mgr'] = 'Ladenburg Thalmann'
ipos.loc[ipos['Lead Mgr'].str.contains('Piper'), 'Lead Mgr'] = 'Piper Jaffray'
ipos.loc[ipos['Lead Mgr'].str.contains('Pali'), 'Lead Mgr'] = 'Pali Capital'
ipos.loc[ipos['Lead Mgr'].str.contains('Paulson'), 'Lead Mgr'] = 'Paulson Investment Co.'
ipos.loc[ipos['Lead Mgr'].str.contains('Roth'), 'Lead Mgr'] = 'Roth Capital'
ipos.loc[ipos['Lead Mgr'].str.contains('Stifel'), 'Lead Mgr'] = 'Stifel Nicolaus'
ipos.loc[ipos['Lead Mgr'].str.contains('SunTrust'), 'Lead Mgr'] = 'SunTrust Robinson'
ipos.loc[ipos['Lead Mgr'].str.contains('Wachovia'), 'Lead Mgr'] = 'Wachovia'
ipos.loc[ipos['Lead Mgr'].str.contains('Wedbush'), 'Lead Mgr'] = 'Wedbush Morgan'
ipos.loc[ipos['Lead Mgr'].str.contains('Blair'), 'Lead Mgr'] = 'William Blair'
ipos.loc[ipos['Lead Mgr'].str.contains('Wunderlich'), 'Lead Mgr'] = 'Wunderlich'
ipos.loc[ipos['Lead Mgr'].str.contains('Max'), 'Lead Mgr'] = 'Maxim Group'
ipos.loc[ipos['Lead Mgr'].str.contains('CIBC'), 'Lead Mgr'] = 'CIBC'
ipos.loc[ipos['Lead Mgr'].str.contains('CRT'), 'Lead Mgr'] = 'CRT Capital'
ipos.loc[ipos['Lead Mgr'].str.contains('HCF'),'Lead Mgr'] = 'HCFP Brenner'
ipos.loc[ipos['Lead Mgr'].str.contains('Cohen'), 'Lead Mgr']  = 'Cohen & Co.'
ipos.loc[ipos['Lead Mgr'].str.contains('Cowen'), 'Lead Mgr'] = 'Cowen & Co.'
ipos.loc[ipos['Lead Mgr'].str.contains('Leerink'), 'Lead Mgr']  = 'Leerink Partners'
ipos.loc[ipos['Lead Mgr'].str.contains('Lynch\xca'), 'Lead Mgr'] = 'Merrill Lynch'
ipos.loc[1210, 'Lead Mgr'] = 'Merrill Lynch'
ipos.loc[1230, 'Lead Mgr'] = 'Merrill Lynch'
ipos.loc[241, 'Lead Mgr'] = 'MDB Capital Group LLC'

这点完成后，将增加承销商的数量。

ipos['Total Underwriters'] = ipos['Lead/Joint-Lead Managers'].map(lambda x: len(x.split('/')))

将添加几个日期相关的特征。这里加入星期几和月份。

ipos['Week Day'] = ipos['Date'].dt.dayofweek.map({0:'Mon', 1:'Tues', 2:'Wed', 3:'Thurs', 4:'Fri', 5:'Sat', 6:'Sun'})
ipos['Month'] = ipos['Date'].map(lambda x: x.month)
ipos['Month'] = ipos['Month'].map({1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun',
                                   7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'})

现在补充几个最终的特征，涉及发行价和开盘价之间的变化，以及发行价和收盘价之间的变化。

ipos['Gap Open Pct'] = (ipos['$ Change Opening'].astype('float')/ipos['Opening Price'].astype('float')) * 100
ipos['Open to Clost Pct'] = (ipos['$ Change Close'].astype('float') - ipos['$ Change Opening'].astype('float'))/ipos['Opening Price'].astype('float') * 100

将这些特征提供给模型之前，需要考虑选择哪些特征。不要在添加时特征时“泄露”了信息。这是一个常见的错误，当向模型提供信息的时候，所用的数据在当时其实是无法获得的，这时候就会发生信息“泄露”。例如，将收盘价添加到模型将使结果完全无效。如果这样做，实际上是为模型提供了它试图预测的答案。
将添加以下特征。

月份（Month）。
星期几（Week Day）。
主要承销商（Lead Mgr）。
承销商总数（Total Underwriters）。
发行价到开盘价的差距百分比（Gap Open Pct）。
发行价到开盘价的美元变化量（$ Chg Opening）。
发行价（Offer Price）。
开盘价（Opening Price）。
标准普尔指数从收盘到开盘的变化百分比（SP Close to Open Chg Pct）。
标准普尔指数前一周的变化（SP Week Change）。
完善模型所需的全部特征后，将其准备好以供模型使用 Patsy 库。如果需要，可以使用 pip 安装 Patsy。Patsy 以原始的形式获取数据，并将其转换为适用于统计模型构建的矩阵。

from patsy import dmatrix

# 使用 dmatrix 函数创建设计矩阵，其中包含指定的特征和目标变量
X = dmatrix('Month + Q("Week Day") + Q("Total Underwriters") + \
            Q("Gap Open Pct") + Q("$ Change Opening") + Q("Lead Mgr") + Q("Offer Price") + \
            Q("Opening Price") + Q("SP Close to Open Chg Pct") + Q("SP Week Change")',
            data=ipos,
            return_type='dataframe')
# 打印设计矩阵 X
X

	Intercept	Month[T.Aug]	Month[T.Dec]	Month[T.Feb]	Month[T.Jan]	Month[T.Jul]	Month[T.Jun]	Month[T.Mar]	Month[T.May]	Month[T.Nov]	...	Q("Lead Mgr")[T.WestPark Capital]	Q("Lead Mgr")[T.William Blair]	Q("Lead Mgr")[T.Wunderlich]	Q("Total Underwriters")	Q("Gap Open Pct")	Q("$ Change Opening")	Q("Offer Price")	Q("Opening Price")	Q("SP Close to Open Chg Pct")	Q("SP Week Change")
0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	3.0	-3.780578	-0.51	14.00	13.49	-0.021079	-0.496349
1	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	2.0	3.669725	0.60	15.75	16.35	-0.021079	-0.496349
2	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	2.0	0.099900	0.01	10.00	10.01	-0.021079	-0.496349
3	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	3.0	30.693069	6.20	14.00	20.20	-0.008236	1.720188
4	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	2.0	9.539474	1.16	11.00	12.16	-0.448697	2.278166
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2687	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	27.073838	5.94	16.00	21.94	0.000000	0.558352
2688	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	2.0	9.338169	2.06	20.00	22.06	0.000000	0.558352
2689	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	10.916667	1.31	10.69	12.00	0.000000	2.083563
2690	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	10.037879	1.06	9.50	10.56	0.000000	4.962166
2691	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	11.958914	1.63	12.00	13.63	0.000000	-2.586920

2692 rows × 155 columns

可以看到 Patsy 已经将分类型数据重新配置为多列，而将连续的数据保存在单个列中。这种操作被称为虚构编码。在这种格式中，每个月都会得到属于自己的列。对于每个代理而言同样如此。例如，如果特定的 IPO 样例（某一行）在 May 这个月发行，那么它在 May 这个列的值就为 1，而该行所有其他月份的列值都为 0。对于分类型的特征，总是有 n-1 个特征列。被排除的列成为了基线，而其他的将和这个基线进行比较。最后，Patsy 还添加了一个截距列。这是回归模型正常运行所需的第一个列。

3 二元分类

试图预测 IPO 是否值得购买，而不是尝试准确地预测第一天的总收益是多少。这里应该指出这里所做的不是投资建议，其目的只是为了说明某个案例。请不要使用这个模型开始随意地进行 IPO 交易。后果会非常严重。现在，为了预测二进制的结果（即 1 或 0，是或否），将称其为逻辑回（logistic regression）的模型开始。逻辑回归使用了逻辑函数。这是很理想的选择，因为逻辑回归有几个数学属性，使其易于使用。
由于逻辑函数的形式，它特别适合于提供概率的估计，以及依据这些估计的二进制响应。任何大于 0.5 的被分类为 1，而任何低于 0.5的被分类为 0。这些 1 和 0 可以对应任何想要分类的事物，不过在这个应用程序中，它将决定对于发行股，是买入（1）还是不买入（0）。
机器学习模型的标准做法是随机地决定哪些实例作为模型的训练数据，而哪些实例被用作测试数据。但是，由于这里的数据是基于时间的，将使用除今年（2015年）之外的所有数据进行训练。然后在 2015 年年初至今的数据上进行测试。

# 2188 是第一个 2015 年数据的索引号，数据是按照日期排序的
X_train, X_test = X[:2188], X[2188:] 
y_train = ipos['$ Chg Open to Close'][:2188].map(lambda x: 1 if x >= 1 else 0) 
y_test = ipos['$ Chg Open to Close'][2188:].map(lambda x: 1 if x >= 1 else 0)

将数据分成训练集和测试集。请注意此处主观地为正向结果设置了 1 美元的阈值。这是为了贯彻这个策略：瞄准长尾中的赢家，而不是任何收益大于 0的收盘。
拟合该模型，方法如下所示。

clf = linear_model.LogisticRegression() 
clf.fit(X_train, y_train)

现在可以在预留的 2015 年数据上，评估模型的表现。

clf.score(X_test, y_test)# 0.7261904761904762

基本策略

ipos[(ipos['Date']>='2015-01-01')]['$ Chg Open to Close'].describe()

对于 2015 年而言，这当然是优于基本策略的预测，但这也可能是带有误导性的结果，因为实际上获益超过 1 美元的 IPO，其比例是非常低的。这意味着对于所有 IPO 都将其预测猜测为 0，也会有相同的结果，不过，这里比较一下两者的区别。
接下来，将处理预测的结果。首先使用结果设置一个数据框，然后输出它们。

pred_label = clf.predict(X_test) 
results=[] 
for pl, tl, idx, chg in zip(pred_label, y_test, y_test.index, ipos.iloc[y_test.index]['$ Chg Open to Close']): 
    if pl == tl: 
        results.append([idx, chg, pl, tl, 1]) 
    else: 
        results.append([idx, chg, pl, tl, 0]) 
rf = pd.DataFrame(results, columns=['index', '$ chg', 'predicted', 'actual', 'correct']) 
rf

	index	$ chg	predicted	actual	correct
0	2188	0.35	0	0	1
1	2189	-0.15	0	0	1
2	2190	0.72	0	0	1
3	2191	2.30	0	1	0
4	2192	-0.10	0	0	1
...	...	...	...	...	...
499	2687	1.37	1	1	1
500	2688	2.44	0	1	0
501	2689	1.38	0	1	0
502	2690	-0.68	0	0	1
503	2691	2.37	0	1	0

504 rows × 5 columns

rf[rf['predicted']==1]['$ chg'].describe()

count    37.000000
mean      2.116216
std       7.061475
min      -9.870000
25%      -1.870000
50%       0.500000
75%       4.770000
max      20.600000
Name: $ chg, dtype: float64

所以，总数从 147 次买入降到 6 次买入。平均值从 0.23 美元上涨到 2.99 美元，但中位数从 0 美元下降到−0.02 美元。看看回报的图表。

fig, ax = plt.subplots(figsize=(15,10)) 
rf[rf['predicted']==1]['$ chg'].plot(kind='bar') 
ax.set_title('Model Predicted Buys', y=1.01) 
ax.set_ylabel('$ Change Open to Close') 
ax.set_xlabel('Index')

从图上看似乎赢得了年中一次很大的收益，以及几次较小的收益和损失。说明对模型的测试还不是很充分。可能只是非常幸运地抓住了这次胜利。还需要评估模型的鲁棒性。可以通过几个方面来实现这点，这里仅仅做两件事情。首先，将阈值从 1 美元降至 0.25 美元，看看模型怎么反应。

X_train, X_test = X[:2188], X[2188:] 
y_train = ipos['$ Chg Open to Close'][:2188].map(lambda x: 1 if x >= .25 else 0) 
y_test = ipos['$ Chg Open to Close'][2188:].map(lambda x: 1 if x >= .25 else 0) 
clf = linear_model.LogisticRegression() 
clf.fit(X_train, y_train) 
clf.score(X_test, y_test)# 0.5476190476190477

现在检查一下结果。

pred_label = clf.predict(X_test) 
results=[] 
for pl, tl, idx, chg in zip(pred_label, y_test, y_test.index, ipos.iloc[y_test.index]['$ Chg Open to Close']): 
 if pl == tl: 
    results.append([idx, chg, pl, tl, 1]) 
 else:
   results.append([idx, chg, pl, tl, 0]) 
rf = pd.DataFrame(results, columns=['index', '$ chg', 'predicted', 'actual', 'correct']) 
rf[rf['predicted']==1]['$ chg'].describe()

count    71.000000
mean      1.639437
std       5.581664
min      -9.870000
25%      -0.975000
50%       0.250000
75%       2.770000
max      20.600000
Name: $ chg, dtype: float64

从结果来看，准确率和平均值都下降了。但是，我们的统计数量从 6 上升到了25，而且仍然远高于基础策略的结果。再做一个测试。现在将 2014 年的数据从训练数据中删除，并将其加入测试数据中。

X_train, X_test = X[:1900], X[1900:] 
y_train = ipos['$ Chg Open to Close'][:1900].map(lambda x: 1 if x >= .25 else 0) 
y_test = ipos['$ Chg Open to Close'][1900:].map(lambda x: 1 if x >= .25 else 0) 
clf = linear_model.LogisticRegression() 
clf.fit(X_train, y_train) 
clf.score(X_test, y_test)# 0.5896464646464646

再次检查

pred_label = clf.predict(X_test) 
results=[] 
for pl, tl, idx, chg in zip(pred_label, y_test, y_test.index, ipos.iloc[y_test.index]['$ Chg Open to Close']): 
 if pl == tl: 
    results.append([idx, chg, pl, tl, 1]) 
 else: 
    results.append([idx, chg, pl, tl, 0]) 
rf = pd.DataFrame(results, columns=['index', '$ chg', 'predicted', 
'actual', 'correct']) 
rf[rf['predicted']==1]['$ chg'].describe()

count    101.000000
mean       1.529901
std        5.159941
min       -9.870000
25%       -0.600000
50%        0.350000
75%        2.560000
max       20.600000
Name: $ chg, dtype: float64

随着 2014 年的数据放入测试集合，可以看到虽然平均值有所下降，但模型的表现仍然要好于投资每一笔 IPO 的简单方法。

4 特征的重要性

哪些特征增加了一个发行股未来成功的概率？不幸的是，对于这个问题没有简单的答案。
由于建立模型时采用的是逻辑回归，所以可以观察每个特征参数的相关系数。请记住逻辑函数使用的是以下形式。

\[\ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x \]

这里，$p$ 表示正向结果的概率，$B_0$ 是截距，$B_1$ 是特征的系数。一旦拟合了模型，就可以检查这些系数。现在立即获取它们。

# 创建一个 DataFrame，包含特征名称和对应的系数
fv = pd.DataFrame({'Feature': X_train.columns, 'Coef': clf.coef_.flatten()})
# 按照系数的大小降序排序
fv_sorted = fv.sort_values('Coef', ascending=False).reset_index(drop=True)
fv_sorted

	Feature	Coef
0	Q("Lead Mgr")[T.Merrill Lynch]	0.953996
1	Q("Lead Mgr")[T.BMO Capital Markets]	0.894796
2	Q("Lead Mgr")[T.Wachovia]	0.697398
3	Q("Lead Mgr")[T.Raymond James]	0.552723
4	Q("Lead Mgr")[T.Friedman Billings Ramsey]	0.538899
...	...	...
150	Q("Lead Mgr")[T.Lazard Capital Markets]	-0.551479
151	Intercept	-0.602364
152	Q("Week Day")[T.Mon]	-0.731993
153	Q("Lead Mgr")[T.Morgan Joseph]	-0.813587
154	Q("Lead Mgr")[T.EarlyBirdCapital]	-1.274693

155 rows × 2 columns

对于分类型的特征，特征系数上的正符号表示这个特征存在时，相对于基线而言它增加了正向结果的概率。对于连续性的特征，正号表示该特征值的增加，会导致正向结果的概率增加。系数的大小表示概率增加的幅度。看看星期几这个特征。 ``` fv[fv['Feature'].str.contains('Week Day')] ```

	Feature	Coef
12	Q("Week Day")[T.Mon]	-0.731993
13	Q("Week Day")[T.Thurs]	0.024071
14	Q("Week Day")[T.Tues]	-0.063116
15	Q("Week Day")[T.Wed]	-0.073945

可以看出没有星期五。这意味着星期五是所有其他同类特征用于比较的基线。根据这里的模型，周四增加了 IPO 成功的几率。回到特征的重要性，很可能认为，在这个时候可以拿出具有最大正系数的那些特征，将它们扔到模型里，然后就会拥有主宰新股市场的一切了。别急着下结论。看看基于正系数大小的前两个特征。 ``` ipos[ipos['Lead Mgr'].str.contains('Keegan|Towbin')] ```

	Date	Issuer	Symbol	Lead/Joint-Lead Managers	Offer Price	Opening Price	1st Day Close	1st Day % Px Chng	$ Change Opening	$ Change Close	...	$ Chg Open to Close	% Chg Open to Close	SP Week Change	SP Close to Open Chg Pct	Lead Mgr	Total Underwriters	Week Day	Month	Gap Open Pct	Open to Clost Pct
923	2011-06-22	Fidus Investment	FDUS	Morgan Keegan	15.0	14.75	15.00	0.00	-0.25	0.00	...	0.25	1.694915	1.930797	-0.003091	Morgan Keegan	1	Wed	Jun	-1.694915	1.694915
1275	2007-02-26	Rosetta Genomics	ROSG	C.E. Unterberg, Towbin	7.0	7.02	7.32	4.57	0.02	0.32	...	0.30	4.273504	0.479826	-0.010330	C.E. Unterberg, Towbin	1	Mon	Feb	0.284900	4.273504
1865	2005-08-04	Advanced Life Sciences	ADLS	C.E. Unterberg, Towbin/ThinkEquity Partners	5.0	5.03	6.00	20.00	0.03	1.00	...	0.97	19.284294	1.302654	0.000000	C.E. Unterberg, Towbin	2	Thurs	Aug	0.596421	19.284294
2312	2002-05-21	Computer Programs and Systems	CPSI	Morgan Keegan/Raymond James	16.5	17.50	18.12	9.82	1.00	1.62	...	0.62	3.542857	1.758604	0.000000	Morgan Keegan	2	Tues	May	5.714286	3.542857
2393	2001-05-23	Smith & Wollensky	SWRG	CE Unterberg Towbin	8.5	8.51	7.77	-8.59	0.01	-0.73	...	-0.74	-8.695652	5.114513	0.000000	CE Unterberg Towbin	1	Wed	May	0.117509	-8.695652
2451	2001-12-14	Northwest Biotherapeutics	NWBT	C.E. Unterberg, Towbin	5.0	5.10	5.31	6.20	0.10	0.31	...	0.21	4.117647	-2.220479	0.000000	C.E. Unterberg, Towbin	1	Fri	Dec	1.960784	4.117647
2583	2000-08-09	Millennium Cell	MCEL	Morgan Keegan	10.0	10.00	10.00	0.00	0.00	0.00	...	0.00	0.000000	4.430627	0.000000	Morgan Keegan	1	Wed	Aug	0.000000	0.000000
2613	2000-08-25	ServiceWare Technologies	SVCW	C.E. Unterberg, Towbin	7.0	8.50	8.75	25.00	1.50	1.75	...	0.25	2.941176	1.608699	0.000000	C.E. Unterberg, Towbin	1	Fri	Aug	17.647059	2.941176

8 rows × 22 columns

前两个特征代表了四次 IPO 的总和。这就是为什么很难从逻辑回归模型提取信息，特别是这么复杂的模型。利用另一种称为随机森林分类器的模型，来获得重要性的度量。

clf_rf = RandomForestClassifier(n_estimators=1000) 
clf_rf.fit(X_train, y_train) 
f_importances = clf_rf.feature_importances_ 
f_names = X_train 
f_std = np.std([tree.feature_importances_ for tree in clf_rf.estimators_], 
axis=0) 
zz = zip(f_importances, f_names, f_std) 
zzs = sorted(zz, key=lambda x: x[0], reverse=True) 
imps = [x[0] for x in zzs[:20]] 
labels = [x[1] for x in zzs[:20]] 
errs = [x[2] for x in zzs[:20]] 
plt.subplots(figsize=(15,10)) 
plt.bar(range(20), imps, color="r", yerr=errs, align="center") 
plt.xticks(range(20), labels, rotation=-70);

数据下载：https://files.cnblogs.com/files/blogs/788620/data.zip?t=1713160352&download=true

posted @ 2024-04-15 13:52 PaleKernel 阅读(194) 评论(1) 收藏举报

刷新页面返回顶部