[译]机器学习简史

翻译自博文。

这是原文博主整理的时间线图片：

AI是现今学术和工业界炙手可热的学科，机器学习是AI重要的分支。企业，学校提供了很多资源去提高这个学科的认知。这个领域最新的进展促生了在多个任务上相较于人类十分可靠的结果。如识别交通标识的比赛中准确率达到了98.98%，高于人类。

这里我想要分享一条简单的机器学习时间线并且并不完全地标出一些里程碑。另外，你应该评论中添加“据我所知”来开始任何讨论。

起先，机器学习普遍认为是由Hebb, 在1949年基于一个神经物理学的学习方程提出来的。它被称之为Hebbian学习理论。简单说明它追求的是循环神经网络（RNN)节点之间的相关性。它能记住这个网络中任何的关系并且表现得像是一个内存条。这段论述正式陈列如下：

Let us assume that the persistence or repetition of a reverberatory activity (or "trace") tends to induce lasting cellular changes that add to its stability.… When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased.[1]

1952年，IBM的Arthur Samuel开发出一个能玩西洋跳棋的程序项目。这个程序能够观察位置并且学习一个能给出更好的下一步跳棋的隐式模型。Samuel让这个程序玩了许多游戏，发现这个程序能够在一段时间之后玩的更好。

利用这个程序，Samuel驳斥了当时普遍的论证，认为机器不能够超出手写代码的功能并且习得像人类一样的学习模式。他发明了“机器学习”并定义为：

a field of study that gives computer the ability without being explicitly programmed.

在1957年， Rosenblatt's Perceptron是第二个有着神经物理背景提出模型概念的人，这个概念与现在的ML模型更相似。在那时这是一个更加激动人心的发现，它比Hebbian的理念更具有应用性。Rosenblatt用下面的话描述了感知机：

The perceptron is designed to illustrate some of the fundamental properties of intelligent systems in general, without becoming too deeply enmeshed in the special, and frequently unknown, conditions which hold for particular biological organisms.[2]

4年后，Widrow[4]提出了delta学习规则并用之于感知机的实际训练当中，这也就是最小二乘的问题。结合这两种思想创造出了一个非常好的线性分类器。但是，激动了没一会的感知机就被Minsky[3]在1969年打死了。他提出了著名的XOR问题，感知机对于这种线性不可分数据分布表示无能为力。当时神经网络(NN)社区感受到了Minsky的寒意。自此以后，NN的研究冬眠到了1980年。

NN研究一直没有多大进展，直到Werbos[6]在1981年提出利用后向传播(BP)算法的多层感知机[MLP]的概念。虽然BP这个想法Linnainmaa[5]在1970年就用“自动微分的反向模式”的名字提出了【等得还是太久了，译者注】。现今BP算法也是NN网络架构中的关键步骤。有了这个新的想法，NN研究再一次加速前行。在1985-1986年接连实现了利用BP算法在实际训练中应用了MLP这个想法（Rumelhart, Hinton, Williams [7] - Hetch, Nielsen[8]）。

另一方面，一个非常著名的ML算法由J.R.Quinlan[9]在1986年提出，这是决策树，用ID3算法实现。这是机器学习另一条主线的闪光起始点。此外，ID3以软件的方式发布，它能够利用它简单的规则和清楚的论断找到在现实生活的用处，而NN模型至今也是一个黑盒子。

ID3之后，许多其他或者提高的版本在社区中提出，例如ID4, 回归树，CART等等，至今它也是一个ML中的活跃主题。

一个最重要ML中的突破就是支持向量机（SVM)了，由Vapnik和Cortes[10]在1995年提出。SVM有非常强的理论支撑和直白准确的结果。就在那时，ML分成了NN派和SVM派。但是接近在2000年，核化版本的SVM出现，NN派已然无力招架。SVM的结果在大多数任务中都比NN模型效果好。另外，SVM相较于NN模型能够利用所有凸优化中的深奥理论，generalization margin theory和核函数，因此，它能从多个学科当中找到巨大的推动力让它取得了巨大的理论和应用提高。

NN接连在1991年Hochreiter[10]和2001年Hochreiter et.al.[11]的论述中遭受重创。这些工作表明当我们应用BP算法，NN层饱和的时候梯度损失。简单来说，经过一定的迭代次数之后，再继续训练神经网络是多余的，因为神经网络层饱和了。几个迭代之后，NN很容易发生过拟合。

再之前一点时间，Freund和Scheapire在1997年提出了一个稳健的ML模型，叫做Adaboost，集成了弱分类器进行结果的提升，在那时这项工作让作者获得了哥德尔奖（Godel Prize）。Adaboost用训练特别简单的弱分类器，并给特别难训练的样本以更加高的权重。这个模型仍然是许多任务的基础，诸如人脸识别和检测。它也是PAC（Probably Approximately Correct）学习理论的一种实现。一般来说，弱分类器用的就是单个决策树节点。他们这么介绍Adaboost:

The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting...[11]

另外一种集成的方法由Breiman [12]在2001年提出，每个弱的决策树利用随机选取一部分样本以及一部分特征进行训练。归咎于它的本身，被称之为随机森林。RF同样有着理论和经验证明其不容易过拟合。而Adaboost也具有过拟合以及受数据中outlier样本影响的缺点。RF同样在kaggle比赛当中的许多任务取得了相当的成功。

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large[12]

到今天，一个属于NN的新时代出现了，它被称之为深度学习。这个称呼引用自NN模型由许多接连的宽感知层结构。第三波NN的浪潮在2005年出现，伴随着出现了过去和现在大牛的工作，诸如Hinton, LeCun, Bengio, Andrew Ng和其他一些重要的研究者。这里陈列了一些重要的标题：

GPU编程
卷积网络[18][20][40]
- 去卷积网络[21]
优化算法
- 随机梯度下降 [19][22]
- BFGS和L-BFGS[23]
- 共轭梯度下降[24]
- 后向传播算法[40][19]
振荡单元，Rectifier Units
稀疏性[15][16]
Dropout网络 [26]
- Maxout网络[25]
非监督NN网络[14]
- 深度信念网络[13]
- 堆栈自编码器，stacked auto-encoders [16][39]
- Denoising NN模型[17]

组合上面和一些没有陈列的概念想法，神经网络在许多任务方面如物体识别，语音识别，NLP等方面击败state of art水平。但是应该指出，这并不是其他ML流派的末日。尽管DL取得了巨大成功，但是也有许多批评直接指出其训练代价大以及需要调大量的外部参数。此外，SVM由于其简洁性更加常被使用(说这话可能会引起巨大讨论，哈哈)。

在结束之前，我需要说明另外一个新兴的ML趋势。万维网和社交媒体的崛起，一个新名词，大数据出现并对ML影响深远。许多强大的ML算法面对如此巨大的数据在小公司或机构里变得毫无用处。因此，研究人员想出了一组新的简单模型被称之为位算法bandit Algorithm[27-38](之前被称之为在线学习，online learning)。这能够让学习更加容易并且对大规模问题也具有非常好的适应能力。

参考文献

[1] Hebb D. O., The organization of behaviour.New York: Wiley & Sons.

[2]Rosenblatt, Frank. "The perceptron: a probabilistic model for information storage and organization in the brain." Psychological review 65.6 (1958): 386.

[3]Minsky, Marvin, and Papert Seymour. "Perceptrons." (1969).

[4]Widrow, Hoff "Adaptive switching circuits." (1960): 96-104.

[5]S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor
expansion of the local rounding errors. Master’s thesis, Univ. Helsinki, 1970.

[6] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th
IFIP Conference, 31.8 - 4.9, NYC, pages 762–770, 1981.

[7] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. No. ICS-8506. CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE SCIENCE, 1985.

[8] Hecht-Nielsen, Robert. "Theory of the backpropagation neural network." Neural Networks, 1989. IJCNN., International Joint Conference on. IEEE, 1989.

[9] Quinlan, J. Ross. "Induction of decision trees." Machine learning 1.1 (1986): 81-106.

[10] Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." Machine learning 20.3 (1995): 273-297.

[11] Freund, Yoav, Robert Schapire, and N. Abe. "A short introduction to boosting."Journal-Japanese Society For Artificial Intelligence 14.771-780 (1999): 1612.

[12] Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.

[13] Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18.7 (2006): 1527-1554.

[14] Bengio, Lamblin, Popovici, Larochelle, "Greedy Layer-Wise
Training of Deep Networks", NIPS’2006

[15] Ranzato, Poultney, Chopra, LeCun " Efficient Learning of Sparse Representations with an Energy-Based Model ", NIPS’2006

[16] Olshausen B a, Field DJ. Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Res. 1997;37(23):3311–25. Available at: http://www.ncbi.nlm.nih.gov/pubmed/9425546.

[17] Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, Extracting and Composing Robust Features with Denoising Autoencoders, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML‘08), pages 1096 - 1103, ACM, 2008.

[18] Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202.

[19] LeCun, Yann, et al. "Gradient-based learning applied to document recognition."Proceedings of the IEEE 86.11 (1998): 2278-2324.

[20] LeCun, Yann, and Yoshua Bengio. "Convolutional networks for images, speech, and time series." The handbook of brain theory and neural networks3361 (1995).

[21] Zeiler, Matthew D., et al. "Deconvolutional networks." Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.

[22] S. Vishwanathan, N. Schraudolph, M. Schmidt, and K. Mur- phy. Accelerated training of conditional random fields with stochastic meta-descent. In International Conference on Ma- chine Learning (ICML ’06), 2006.

[23] Nocedal, J. (1980). ”Updating Quasi-Newton Matrices with Limited Storage.” Mathematics of Computation 35 (151): 773782. doi:10.1090/S0025-5718-1980-0572855-

[24] S. Yun and K.-C. Toh, “A coordinate gradient descent method for l1- regularized convex minimization,” Computational Optimizations and Applications, vol. 48, no. 2, pp. 273–307, 2011.

[25] Goodfellow I, Warde-Farley D. Maxout networks. arXiv Prepr arXiv …. 2013. Available at: http://arxiv.org/abs/1302.4389. Accessed March 20, 2014.

[26] Wan L, Zeiler M. Regularization of neural networks using dropconnect. Proc …. 2013;(1). Available at: http://machinelearning.wustl.edu/mlpapers/papers/icml2013_wan13. Accessed March 13, 2014.

[27] Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford, A Reliable Effective Terascale Linear Learning System, 2011

[28] M. Hoffman, D. Blei, F. Bach, Online Learning for Latent Dirichlet Allocation, in Neural Information Processing Systems (NIPS) 2010.

[29] Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang Agnostic Active Learning Without Constraints NIPS 2010.

[30] John Duchi, Elad Hazan, and Yoram Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR 2011 & COLT 2010.

[31] H. Brendan McMahan, Matthew Streeter, Adaptive Bound Optimization for Online Convex Optimization, COLT 2010.

[32] Nikos Karampatziakis and John Langford, Importance Weight Aware Gradient Updates UAI 2010.

[33] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, Josh Attenberg, Feature Hashing for Large Scale Multitask Learning, ICML 2009.

[34] Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and SVN Vishwanathan, Hash Kernels for Structured Data, AISTAT 2009.

[35] John Langford, Lihong Li, and Tong Zhang, Sparse Online Learning via Truncated Gradient, NIPS 2008.

[36] Leon Bottou, Stochastic Gradient Descent, 2007.

[37] Avrim Blum, Adam Kalai, and John Langford Beating the Holdout: Bounds for KFold and Progressive Cross-Validation. COLT99 pages 203-208.

[38] Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782.

[39] D. H. Ballard. Modular learning in neural networks. In AAAI, pages 279–284, 1987.

[40] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f ̈ur In-
formatik, Lehrstuhl Prof. Brauer, Technische Universit ̈at M ̈unchen, 1991. Advisor: J. Schmidhuber.

posted on 2018-03-01 20:37 BPassionate 阅读(830) 评论(0) 收藏举报

刷新页面返回顶部

BPassionate

[译]机器学习简史

翻译自博文。

参考文献

导航

公告