Machine_Learning_in_Action00 - Introduction
Machine Learning in Action Introduction
by Peter Harrington
baiduyun c9fq
Contents
- Classification
- Machine Learning basics
机器学习基础 - Classifying with K-Nearest Neighbors
K近邻分类 - Splitting datasets one feature at a time: decision trees
决策树 - Classifying with probability theory: naive Bayes
朴素贝叶斯 - Logistic regression
逻辑回归 - SVM
支持向量机 - Improving classification with the AdaBoost meta-algorith
使用AdaBoost原算法进行分类增强
- Machine Learning basics
- Forecasting numeric values with regression
使用回归预测数值型数据- Predicting numeric values: regression
回归 - Tree-based regression
树回归
- Predicting numeric values: regression
- Unsupervised learning
- Grouping unlabeled items using k-means clustering
K均值聚类 - Association analysis with the Apriori algorithm
使用Apriori算法进行关联分析 - Efficiently finding frequent itemsets with FP-growth
使用FP-growth 有效的发现频繁项集
- Grouping unlabeled items using k-means clustering
- Additional tools
- Using principal component analysis to simplify data
使用PCA简化数据 - Simplifying data with the sigular value decomposition
SVD简化数据 - Big data and MapReduce
大数据与MapReduce
- Using principal component analysis to simplify data
Preface
After college I went to work for Intel in California and mainland China. Originally my plan was to go back to grad school after two years, but time flies when you are having fun, and two years turned into six. I realized I had to go back at that point, and I didn’t want to do night school or online learning, I wanted to sit on campus and soak up everything a university has to offer. The best part of college is not the classes you take or research you do, but the peripheral things: meeting people, going to seminars,joining organizations, dropping in on classes, and learning what you don’t know.
大学毕业之后,去了因特尔在加利福尼亚和中国大陆的分公司。我原先的计划是在两年后回学校读硕士,但是时光如梭,两年变成了六年。我意识到我必须回去,我也不想读夜校或者网上课程,我想坐在校园里贪婪地吸收着学校能提供给我的一切。大学的精髓不是坐在教室里听课,也不是做研究,而是一些外围的事情:与人交流,参加研讨会,参与团体,沉浸在课程里,学习你所未知的东西。
Sometime in 2008 I was helping set up for a career fair. I began to talk to someone from a large financial institution and they wanted me to interview for a position modeling credit risk (figuring out if someone is going to pay off their loans or not). They asked me how much stochastic calculus I knew. At the time, I wasn’t sure I knew what the word stochastic meant. They were hiring for a geographic location my body couldn’t tolerate, so I decided not to pursue it any further. But this stochastic stuff interested me, so I went to the course catalog and looked for any class being offered with the word “stochastic” in its title. The class I found was “Discrete-time Stochastic Systems.” I started attending the class without registering, doing the homework and taking tests. Eventually I was noticed by the professor and she was kind enough to let me continue, for which I am very grateful. This class was the first time I saw probability applied to an algorithm. I had seen algorithms take an averaged value as input before, but this was different: the variance and mean were internal values in these algorithms. The course was about “time series” data where every piece of data is a regularly spaced sample. I found another course with Machine Learning in the title. In this class the data was not assumed to be uniformly spaced in time, and they covered more algorithms but with less rigor. I later realized that similar methods were also being taught in the economics, electrical engineering, and computer science departments.
2008年的一些时间,我有帮忙校园招聘的事情。我开始与一个大型金融机构里的人交谈,他们想让我来应聘一个关于信用卡欺诈的职位(预测人们是否还贷款)。他们问我对随机微积分(stochastic calculus)了解多少。当时我不确定“随机”的含义。他们招募的是一个关于地理的职位,而我的身体适应不了,所以我决定放弃这个职业。但是这个叫做“随机”的东西吸引了我,所以我到课程目录中找到所有名字中包含“随机”的课程。最终我发现了一门课程叫做“离散时间随机系统”。我开始旁听,并做了作业,参加测试。最终教授发现了我,但她仍然让我继续参加课程,对此我很感激。这门课程让我第一次见到了将概率应用于算法。之前我见到的算法是将平均值作为输入,而这次完全不同:方差和均值是算法的内置变量。这门课程是关于“时间序列”的数据,每一段数据是一个按规则分割的样例。我发现另一个叫做机器学习的课程。其中的数据并不假设时间上是均匀的,这门课程还包含了更多不严格的算法。随后我意识到简单的方法也同样能够应用到经济,电子工程和计算机科学领域。
In early 2009, I graduated and moved to Silicon Valley to start work as a software consultant. Over the next two years, I worked with eight companies on a very wide range of technologies and saw two trends emerge which make up the major thesis for this book: first, in order to develop a compelling application you need to do more than just connect data sources; and second, employers want people who understandtheory and can also program.
2009年初,毕业之后我搬到了硅谷开始做起了软件咨询。随后的两年,我参与了8家公司,涉及了很多的技术,并且看到了两大趋势,并形成了本书的基本理论:为了开发出出色的应用程序,你需要做的不仅仅是连接数据源;雇主们想要的人是既懂理论又懂编程的人。
A large portion of a programmer’s job can be compared to the concept of connecting pipes—except that instead of pipes, programmers connect the flow of data—and monstrous fortunes have been made doing exactly that. Let me give you an example. You could make an application that sells things online—the big picture for this would be allowing people a way to post things and to view what others have posted. To do this you could create a web form that allows users to enter data about what they are selling and then this data would be shipped off to a data store. In order for other users to see what a user is selling, you would have to ship the data out of the data store and display it appropriately. I’m sure people will continue to make money this way; however to make the application really good you need to add a level of intelligence. This intelligence could do things like automatically remove inappropriate postings, detect fraudulent transactions, direct users to things they might like, and forecast site traffic. To accomplish these objectives, you would need to apply machine learning. The end user would not know that there is magic going on behind the scenes; to them your application “just works,” which is the hallmark of a well-built product.
程序员大多数的工作可以类比连接管道的概念-只不过程序员连接的是数据流,而不是管道-正是因为这样才创造了巨大的财富。比如你可以做一个在线卖商品的应用-可以让人们发布想要卖的商品,并能够看到别人发布的商品信息。创建一个网页,让用户输入想要卖的的商品,然后登录到数据库。为了能让别人看到,还需要将这些数据推送并展示出来。我确信人们会继续从中获利;然而为了让这个应用程序真正变好,你需要加入一些智能。这个智能能做比如自动移除不正确的数据,检测欺诈交易,把相关的数据推送给可能相关的人,预测交通。为了完成这些事,你需要使用机器学习。终端用户并不知道背后的魔法,对他们来说,你的应用程序是“能用就行”,这就是好产品的标志。
An organization may choose to hire a group of theoretical people, or “thinkers,” and a set of practical people, “doers.” The thinkers may have spent a lot of time in academia, and their day-to-day job may be pulling ideas from papers and modeling them with very high-level tools or mathematics. The doers interface with the real world by writing the code and dealing with the imperfections of a non-ideal world, such as machines that break down or noisy data. Separating thinkers from doers is a bad idea and successful organizations realize this. (One of the tenets of lean manufacturing is for the thinkers to get their hands dirty with actual doing.) When there is a limited amount of money to be spent on hiring, who will get hired more readily—the thinker or the doer? Probably the doer, but in reality employers want both. Things need to get built, but when applications call for more demanding algorithms it is useful to have someone who can read papers, pull out the idea, implement it in real code, and iterate.
一个组织可能会招募一组理论家,或“思想家”,和一群实践者。思想家在学术上花大量的时间,他们每天的工作是从论文中找出一些观点,然后用高级工具或数学建模。实践者面向真实世界,比如损坏的机器或有噪音的数据。从实践者中分离出思想者不是一个好点子,而成功的组织会发现这一点。(精益制造的原则之一是让思想家染手具体的事情)。实践者往往有更多的机会获得一个职位,但是老板们往往两个都想要。理想的情况是既能够读论文,抽离出论文思想,然后变成真实代码,如此往复。
I didn’t see a book that addressed the problem of bridging the gap between thinkers and doers in the context of machine learning algorithms. The goal of this book is to fill that void, and, along the way, to introduce uses of machine learning algorithms so that the reader can build better applications.
机器学习的书中,我从未见到过有人解决了思想家和实践者之间的鸿沟。这本书的目的是填补这个鸿沟,在此过程中顺带介绍机器学习算法,以此,读者能够做出更好的应用程序。
Algorithms
2007.12 IEEE 国际会议上发表了一篇文章, Top 10 Algorithms in Data Mining 总结了目前常用的10个算法。
以其中8个构成本书,另外两个是谷歌的搜索排序算法 PageRank 和 Expectation Maximization。PageRank 算法已有很多资料来介绍,而 Expectation Maximization 算法太偏重于数学。
Cover


浙公网安备 33010602011771号