Recursive Partitioning with Hierarchical Clustering in R for Geospatial Data Analysis

Recursive Partitioning According to a Criterion in R

Introduction

Recursive partitioning is a technique used in data analysis and machine learning to divide a dataset into smaller subsets based on a predefined criterion. In this article, we will explore how to implement recursive partitioning in R using the hclust function from the stats package.

Problem Statement

The problem at hand involves grouping a dataset by latitude and longitude values using hierarchical clustering (HCLUST) and then recursively applying the same clustering process to each cluster within the last iteration. The goal is to obtain groups with at most two elements while preserving the hierarchy of the original clusters.

Background

Hierarchical clustering (HCLUST) is a type of unsupervised machine learning algorithm that builds a hierarchy of clusters by merging or splitting data points based on their distance. In this case, we will use the hclust function to perform HCLUST on our dataset.

The cutree function is used to extract the cluster labels from the resulting dendrogram, which represents the hierarchical structure of the clusters.

Solution Overview

To solve this problem, we need to create a recursive algorithm that applies the clustering process to each cluster within the last iteration. We will use a custom R function called repartition to achieve this.

The main components of our solution are:

  • The multi_cut function, which recursively applies the clustering process to each cluster.
  • The repartition function, which ties everything together and returns the final clustered dataset.

Code Implementation

# Define a function for hierarchical clustering (HCLUST)
hclust <- function(dist) {
  # Perform HCLUST on the distance matrix
  return(hclust(dist))
}

# Define a function for recursive partitioning
multi_cut <- function(data, n = 3) {
  # Extract the cluster labels from the data
  cluster <- cutree(hclust(dist(data[1:2])), k = n)
  
  # If the cluster label does not exist in the data, add it
  if(!"cluster" %in% names(data)) {
    data$cluster <- cluster
  } else {
    data$cluster <- paste(data$cluster, cluster)
  }
  
  # Split the data into clusters based on the new labels
  clusters <- split(data, cut(data$cluster, n))
  
  # Recursively apply the clustering process to each cluster
  result <- lapply(clusters, function(x) multi_cut(x, n - 1))
  
  # Combine the results from all recursive calls
  return(rbind(unlist(result)))
}

# Define a function for recursive partitioning with custom labels
repartition <- function(data) {
  # Perform HCLUST on the data and extract the cluster labels
  cluster <- cutree(hclust(dist(data[1:2])))
  
  # Add custom labels to each unique value in the cluster label
  clusters <- c()
  for(i in unique(cluster)) {
    label <- substr(i, nchar(i) - 3 + 1, nchar(i))
    clusters <- c(clusters, paste("Cluster", i, ":", label, sep = ""))
  }
  
  # Assign custom labels to the data
  for(i in 1:nrow(data)) {
    data$cluster[i] <- clusters[which(cluster == data$cluster[i])]
  }
  
  return(data)
}

Example Usage

# Load the necessary libraries
library(ggplot2)

# Create a sample dataset with latitude and longitude values
data <- data.frame(latitude = c(23.13659, 23.49100, 23.49138, 23.49053,
                                23.45525, 23.44633, 23.44412, 23.44085,
                                23.43415, 23.43927), longitude = c(-11.64711, -11.67840,
                                                                    -11.68223, -11.68326, -11.94486, -11.94500, -11.93693,
                                                                    -11.93217, -11.92761, -11.86433)))

# Apply the recursive partitioning function to the dataset
result <- repartition(data)

# Print the resulting clustered data
print(result)

# Plot a map using ggplot2 based on the cluster labels
library(ggplot2)
ggplot(result, aes(longitude, latitude, color = factor(cluster))) +
  geom_point()

Conclusion

In this article, we explored how to implement recursive partitioning in R using the hclust function from the stats package. We created a custom R function called repartition that applies the clustering process to each cluster within the last iteration and returns the final clustered dataset.


Last modified on 2023-07-03