we take building a Spam Classifer as example.
These contents below is a little disjointed.
Prioritizing What to work on
Improve the accuracy of the classifier
- Collect lots of data(For example “honeypot” project but doesn’t always work)
- Develop sophisticated features(using email header data in spam emails)
- Develop algorithms to process ur input in different ways(recognizing misspellings in spam).
Error Analysis
Recommended approach
- Start with a simple algorithm which can be implement quickly and test it on cross-validation data.
- Plot learning curves to decide if more data, more features, etc.
- Error analysis: Manually examine the examples(in cross validation set). See if ur spot any systematic trend in what type of examples it is making errors on.
Error Analysis
Numerical Evaluation
It is very important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm’s performance. For example if we use stemming, which is the process of treating the same word with different forms (fail/failing/failed) as one word (fail), and get a 3% error rate instead of 5%, then we should definitely add it to our model. However, if we try to distinguish between upper case and lower case letters and end up getting a 3.2% error rate instead of 3%, then we should avoid using this new feature. Hence, we should try new things, get a numerical value for our error rate, and based on our result decide whether we want to keep the new feature or not.
Error metrics for skewed classes
Skewed Classes
There are a gap between the two proportions of 2 classes.
Error Metrics
Precision
(Of all patients where we predicted y=1 what fraction actually has cancer?)
Recall
(Of all patients that actually have cancer, what fraction did we correctly detect as having cancer?)
Trade off
Logistic Regression as Example
if the threshold change, the predic(P) and recall(R) value will also change.s
score
Data for machine learning
“It’s not who has the best algorithm that wins.
It’s who has the most data.”
Large data rationale
- Use a learning algorithm with many parameters
- Use a very large training set(unlikely to overfit)
- Post title: 11_Machine Learning System Design
- Create time: 2022-02-12 08:32:48
- Post link: Machine-Learning/11-machine-learning-system-design/
- Copyright notice: All articles in this blog are licensed under BY-NC-SA unless stating additionally.