11_Machine Learning System Design

we take building a Spam Classifer as example.

These contents below is a little disjointed.

Prioritizing What to work on

Improve the accuracy of the classifier

Collect lots of data(For example “honeypot” project but doesn’t always work)
Develop sophisticated features(using email header data in spam emails)
Develop algorithms to process ur input in different ways(recognizing misspellings in spam).

Error Analysis

Recommended approach

Start with a simple algorithm which can be implement quickly and test it on cross-validation data.
Plot learning curves to decide if more data, more features, etc.
Error analysis: Manually examine the examples(in cross validation set). See if ur spot any systematic trend in what type of examples it is making errors on.

Error Analysis

Numerical Evaluation

It is very important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm’s performance. For example if we use stemming, which is the process of treating the same word with different forms (fail/failing/failed) as one word (fail), and get a 3% error rate instead of 5%, then we should definitely add it to our model. However, if we try to distinguish between upper case and lower case letters and end up getting a 3.2% error rate instead of 3%, then we should avoid using this new feature. Hence, we should try new things, get a numerical value for our error rate, and based on our result decide whether we want to keep the new feature or not.

Error metrics for skewed classes

Skewed Classes

There are a gap between the two proportions of 2 classes.

Error Metrics

Precision

(Of all patients where we predicted y=1 what fraction actually has cancer?)

Recall

(Of all patients that actually have cancer, what fraction did we correctly detect as having cancer?)

Trade off

Logistic Regression as Example

if the threshold change, the predic(P) and recall(R) value will also change.s

score

Data for machine learning

“It’s not who has the best algorithm that wins.
It’s who has the most data.”

Large data rationale

Use a learning algorithm with many parameters
Use a very large training set(unlikely to overfit)