# 7 Machine Learning System Design

Content

7 Machine Learning System Design

7.1 Prioritizing What to Work On

7.2 Error Analysis

7.3 Error Metrics for Skewed Classed

7.3.1 Precision/Recall

7.3.2 Trading off precision and recall: F1 Score

7.4 Data for machine learning

## 7.1 Prioritizing What to Work On

• Collect lots of data
• E.g. "honeypot" project.
• Develop sophisticated features based on email routing information (from email header).
• Develop sophisticated features for message body, e.g. should "discount" and "discounts" be treated as the same word? How about "deal" and "Dealer"? Features about punctuation?
• Develop sophisticated algorithm to detect misspellings (e.g. m0rtgage, med1cine, w4tches.)

## 7.2 Error Analysis

1. Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
2. Plot learning curves to decide if more data, more features, etc. are likely to help.
3. Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.

## 7.3 Error Metrics for Skewed Classed

### 7.3.1 Precision/Recall

#predicted positive = 0, 除0是无意义的。

### 7.3.2 Trading off precision and recall: F1 Score

Predict 1 if: (x)≥0.7

Predict 0 if: (x)<0.7

Predict 1 if: (x)≥0.3

Predict 0 if: (x)<0.3

1 / F1 = (1 / P + 1 / R) / 2

F1 = 2PR / (P + R)

• P = 0，R = 1，则F1 =０
• P = 1，R = 0，则F1 = 0
• P = 1，R = 1，则F1 = 1

## 7.4 Data for machine learning

"It's not who has the best algorithm that wins. It's who has the most data."

(我想这就是BIG DATA的魅力吧)

• Large data rationale

posted @ 2016-04-18 00:06 llhthinker 阅读(...) 评论(...) 编辑 收藏
TOP