Reducing Dimensionality with Cluster PAM While Keeping Columns Available for Future Reference

Cluster PAM in R - How to Ignore a Column/Variable but Still Keep it

The K-Means Plus (KMP) algorithm is an extension of the K-means clustering algorithm that adds new data points to existing clusters when they are too far away from any cluster centroid. The K-Means algorithm, on the other hand, only adds new data points to a new cluster if the point lies within the specified tolerance distance from any cluster centroid.

The PAM (Partitioning Around Medoids) algorithm is an extension of the K-means clustering algorithm that uses medoids instead of centroids as the reference points for cluster assignment. In this algorithm, a “medoid” is an object within a cluster that is representative of all other objects in the cluster.

What are Medoids?

A medoid is an object within a cluster that represents the cluster as a whole. The concept of a medoid was first introduced by John B. Gordon et al. in 1966. It is similar to the centroid, but instead of finding the centroid with an average value for each dimension, it finds the medoid by minimizing the sum of squared distances between the object and other objects within the cluster.

The PAM algorithm uses medoids to assign new data points to clusters. When adding a new data point to a cluster, the algorithm looks for the closest “medoid” in the existing cluster and moves the new data point to that location.

How does Cluster PAM work?

Cluster PAM is an extension of the K-Means clustering algorithm that uses medoids instead of centroids. The basic steps involved in Cluster PAM are:

Initialize random locations for each medoid.
Assign each data point to the closest medoid.
Update the location of each medoid based on the mean of all assigned points within its cluster.

Ignoring a Column/Variable while Clustering with Cluster PAM

One common requirement in clustering analysis is to ignore certain columns or variables while performing the clustering algorithm. This can be done by removing the column from the dataset before passing it to the clustering algorithm. However, if you still want to keep that column for future reference and add a cluster variable to your original dataset, then using Cluster PAM with some modifications would work.

Here’s an example of how you could modify Cluster PAM in R to ignore a certain column while performing the clustering:

# Load the necessary libraries.
library(cluster)

# Create a sample dataset.
data <- data.frame(
    ID = 1:10,
    variableA = rnorm(10, mean = 0, sd = 1),
    variableB = rnorm(10, mean = 0, sd = 1)
)

# Set the column to ignore as the first column (ID) and add a cluster column.
pam.result <- pam(data[2:ncol(data)], 3, FALSE, "euclidean")

# Create the cluster values from PAM.
cluster.values <- result$silinfo[[1]][1:nrow(pam.result$silinfo[[1]])]

In this code:

data[2:ncol(data)] refers to all columns in the dataset except for the first one (ID). This column is ignored during clustering.
3 specifies that you want to cluster into three groups.
FALSE indicates that you don’t want to use a minimum distance parameter.

Now, let’s modify this code to add a cluster variable to your original dataset. Here’s how:

# Load the necessary libraries.
library(cluster)

# Create a sample dataset.
data <- data.frame(
    ID = 1:10,
    variableA = rnorm(10, mean = 0, sd = 1),
    variableB = rnorm(10, mean = 0, sd = 1)
)

# Set the column to ignore as the first column (ID) and add a cluster column.
pam.result <- pam(data[2:ncol(data)], 3, FALSE, "euclidean")

# Create the cluster values from PAM.
cluster.values <- result$silinfo[[1]][1:nrow(pam.result$silinfo[[1]])]

# Add the cluster variable to your data.
data$Cluster <- as.factor(cluster.values)

In this modified code:

data[2:ncol(data)] refers to all columns in the dataset except for the first one (ID). This column is ignored during clustering.
3 specifies that you want to cluster into three groups.
FALSE indicates that you don’t want to use a minimum distance parameter.

Now, let’s discuss what “silinfo” in R really means and why we need it. The silinfo object returns information about the silhouette score for each data point and cluster assignment. The silhouette value is between -1 and 1 and reflects how well each observation fits into its nearest cluster compared to all other clusters.

The silhouette coefficient can help you evaluate the quality of your clustering results:

Silhouette Coefficient = (b - a) / max(a, b)

Here “a” represents the average distance between points in cluster i and points not in cluster i. “b” is the average distance between points in cluster i and other clusters.

Now let’s add the silinfo to your code:

# Load the necessary libraries.
library(cluster)

# Create a sample dataset.
data <- data.frame(
    ID = 1:10,
    variableA = rnorm(10, mean = 0, sd = 1),
    variableB = rnorm(10, mean = 0, sd = 1)
)

# Set the column to ignore as the first column (ID) and add a cluster column.
pam.result <- pam(data[2:ncol(data)], 3, FALSE, "euclidean")

# Create the cluster values from PAM and get the silinfo.
result <- clusterPAM(data[2:ncol(data)], 3, FALSE, method = 'complete')
silinfo <- result$silinfo[[1]][1:nrow(result$silinfo[[1]])]

print(silinfo)

In this code:

We’re using the clusterPAM() function which performs the same operation as pam().
We get the silinfo object.

Now, let’s see an example of how you could use Cluster PAM to reduce dimensionality of a dataset and still keep certain columns available for future reference. Let’s assume that we want to cluster our data into 3 groups using the “euclidean” distance metric but we also need to preserve a certain number of variables.

Here’s an example code snippet:

# Load the necessary libraries.
library(cluster)

# Create a sample dataset with several columns and 10 rows.
data <- data.frame(
    ID = 1:10,
    variableA = rnorm(10, mean = 0, sd = 1),
    variableB = rnorm(10, mean = 0, sd = 1),
    variableC = rnorm(10, mean = 0, sd = 1)
)

# Set the columns to ignore and add a cluster column.
pam.result <- pam(data[, c(2, 3)], 3, FALSE, "euclidean")

# Create the cluster values from PAM.
cluster.values <- result$silinfo[[1]][1:nrow(result$silinfo[[1]])]

# Add the cluster variable to your data.
data$Cluster <- as.factor(cluster.values)

print(pam.result)

In this code:

We create a sample dataset with three columns (variableA, variableB and variableC).
We use Cluster PAM to reduce dimensionality of our data to 3 groups but we only consider the “euclidean” distance metric for our clustering analysis. This means that only two variables will be used by the algorithm.
The silinfo object is returned and printed in order to verify if a good number of dimensions were selected.

By preserving certain columns available for future reference, you can still keep track of your data while reducing its dimensionality.

Last modified on 2025-03-21