Logo
All Questions

Implement k-means clustering.

DifficultyanalyticalAsked at Amazon, Meta (Facebook), LinkedIn

Question Explain

The question is asking you to implement the k-means clustering algorithm, a type of unsupervised machine learning algorithm. The main idea of the k-means clustering algorithm is to separate data points into k different clusters where each data point belongs to the closest mean ('centroid' of a cluster). The steps to answer this question would be:

  1. Understand the theory behind the k-means clustering algorithm.
  2. Start with an initial set of means and classify each data point by finding the nearest mean.
  3. Determine the new means by computing the average of the data points within each cluster.
  4. Determine if the means impact the classification of any data points. If yes, go back to step 2, otherwise stop.

It's important to have a solid understanding of what k-means clustering means theoretically and practically, as well as familiarity with associated terminologies and parameters.

Answer Example 1

Here is a simple implementation of k-means algorithm in Python:

import numpy as np

def kmeans(data, k, max_iters=100):
    # Initialize centroids randomly
    centroids = data[np.random.choice(range(data.shape[0]), size = k, replace = False)]

    for _ in range(max_iters):
        # Set a cluster to each data point by finding the nearest centroid
        labels = np.argmin(np.linalg.norm(data[:, np.newaxis] - centroids, axis=-1), axis=-1)

        # Compute new centroids from the clusters
        new_centroids = np.array([data[labels==i].mean(axis=0) for i in range(k)])

        # Check for convergence
        if np.all(centroids == new_centroids):
            break

        centroids = new_centroids

    return centroids, labels

This code works by initially selecting random data points as the centroids. Then, it iterates over a set maximum number of iterations, during which it assigns every point to the nearest centroid and recalculates the centroid positions. This process repeats until either the maximum number of iterations is reached or the centroids stop changing, signalling convergence.

Note: It's important to contextualise that this is a basic version of k-means. In real applications, you might need to add more functionality such as removing empty clusters or halt if the shift of any centroid is smaller than a given threshold.

Answer Example 2

from sklearn.cluster import KMeans
X = ...  # your data
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
kmeans.labels_

In case your interviewer is looking for how to implement k-means using an already existing machine learning library like scikit-learn, you may use the above implementation. This is especially useful in an interview scenario as it saves time and makes your code much cleaner, but ensure to understand what's happening under the hood as the interviewer may ask about the working mechanism behind it.

More Questions

Question Quick Reference by Category: