Assumption that we know how many clusters there are as a prior ( in K-Means). Designed for vector quantization: replacing examples with the mean of their cluster (collapsing a bunch of examples of a class down to a single example)
Can also be seen as a really bad latent-factor model
K-means partitions the space into convex regions, but clusters in the data might not be convex
Minimize
- Pick some
- Assign a cluster to different points randomly
- Iterate
- Center update β calculate average for each cluster (using euclidian distance)
- Label update β re-assign the data to the closest cluster center
- If no labels changed, finish (model has converged)
Warning: the clustering is initialization dependent and converges to a local minimum. Often requires some amount of random runs to approximate a good solution, pick best one.
Limited to compact/spherical clusters in high-dimensions (which is poor for modeling clusters with the same mean but different distributions)
Advantages
- easy to implement and interpret
- simple to understand
- computationally more efficient than other clustering algorithms
Disadvantages
- need to specify K
- dependent on initialization
- sensitive to scale of features (need to normalize/standardize)
Cost
- Dominated by calculating distance from each to each mean
K-means, unlike the classification and regression models we studied in previous chapters, can get βstuckβ in a bad solution. For example, if we were unlucky and initialized K-means with the following labels. To solve this problem when clustering data using K-means, we should randomly re-initialize the labels a few times, run K-means for each initialization, and pick the clustering that has the lowest final total WSSD.