13_K-means and Principal Component Analysis
Carpe Tu Black Whistle

Clustering

given unlabeled dataset and we would like to have an algorithm automatically group the data into coherent subsets

K-means

K-means is an iterative algorithm and it does two things

  • First, is a cluster assignment step.
  • Second is a moving centroid step.

centroid: n. (数)形心

K-means algorithm is by far the most popular by far the most widely used clustering algorithm.

Algorithm

Input:

  • {}

Randomly initializecluster centroids
Repeat{
forto
index(from 1 to) of cluster centroid
closest to
forto
average(mean) of points assigned to cluster
}

Note

  1. For the First Step, assign the data intogroup according the distance to centroid.
  2. The other part(Step 2), move the centroid step. If there is a cluster centroid with zero points to it . In that case, the most common thing to do is just eliminate if u can accept adim output
  3. Otherwise, randomly reinitidized.

Optimization Objective

definition



Optimization objective

the optimization objective is also called distortion function or the distortion of the K-means algorithm


Randomly Initialize

  • Should have
  • Randomly picktraining examples
  • Setequal to theseexamples.

To avoid K-means has gotten stuck to the local optima, the solution is trying multiple, randomly initializations.(run hundred times)

For1 to 100{

.


}

Pick clustering that gave lowest cost

Choosing K

Elbow method

  • Vary K(from 1) and compute the distortion J, and choose the turning point(like humans’ elbow)
  • There is a situation with no clear elbow.

Elbow method is worth the shot but not necessarily have a very high expection of its working for any particular problem.

Other way

Thinking about the purposed of downstream, or other way to set the parameter K.

Dimensionality Reduction

There are two motivations for Dimensionality Reduction

data compression

  1. use up less computer memory.
  2. speed up running time of learning algorithm.

image

Visualization

image
Reduce from n-dimension to k-dimension: Findvectorsonto which to project the data, so as to minimize the projection error.

Principal Component Analysis (PCA)

by far the most popular algorithm. This section will present the problem formulation for PCA

What PCA does formally is it tries to find a lower dimensional surface.

The goal of PCA is finding a direction(a vector () onto which to project the data so as to minimize the projection error.

Difference with Linear Regression

image

  • for linear regression, sum the vertical error
  • But PCA, accumlate the distance to the line.

Algorithm

Training set:

Preprocessing(feature scaling/mean normalization):


Replace eachwith
If different features on different scales(e.g.), scaling features to have comparable range of values.

Main process

Reduce data fromdimensions todimentsions
Compute “covariance matrix”:

Compute “eigenvectors“ of matrix:

The plan B to compute the eigenvectors

1
eig(Sigma)

image

Summary code

1
2
3
4
Sigma = (1/m) * X' * X;
[U,S,V] = svd(Sigma);
Ureduce = U(:, 1:k);
z = Ureduce'*x;

Reconstruction from PCA

image

Choosing k ( No.of principal components)

Average squared projection error:
Total variation in the data:

Typically, chooseto be the smallest value so that

“99% of variance is retained”

image

1
[U, S, V] = svd(Sigma)

Pick the smallest value offor which

(99% of variance is retained)

Advice

Note:

the Mappingshould be defined by running PCA only on training set. This mapping can be applied as well to examplesandin the cross validation and test sets.

Bad use of PCA

To prevent overfitting
Useinstead ofto reduce the number of features to.
Thus, fewer features, less likely to overfit.

This might work, but isn’t a good way to address overfitting.
Use regularization instead:

image