Adding Text Annotation to a Clustering Scatter Plot (tSNE)
Introduction
The tSNE (t-Distributed Stochastic Neighbor Embedding) algorithm is a popular dimensionality reduction technique used in various fields, including data visualization and clustering. One of the key challenges in visualizing high-dimensional data using tSNE is effectively communicating the underlying structure of the data. Adding text annotations to a clustering scatter plot can provide valuable insights into the relationships between different clusters and data points.
In this article, we will explore how to add text annotations to a clustering scatter plot created with tSNE using R and the ggrepel package.
Background
The ggplot2 package provides an elegant and consistent syntax for creating complex graphics. However, when working with large datasets or plots that require multiple elements, it can be challenging to position labels in a way that minimizes overlap with other elements on the plot.
This is where the ggrepel package comes into play. The ggrepel package extends the ggplot2 system by providing additional features for text label placement and positioning. One of its most useful functions is geom_label_repel, which allows you to easily add labels to plots while trying to “repel” the labels from not overlapping with other elements.
Prerequisites
- R
- ggplot2 package
- ggrepel package
- dplyr package (for summarization and grouping)
- data frame with XY coordinates and cluster assignments (
df) - additional data frame for label text (
label.df)
Step 1: Load Necessary Libraries
library(dplyr)
library(ggplot2)
library(ggrepel)
Step 2: Create Sample Data
Create a sample dataset with XY coordinates and cluster assignments.
set.seed(1)
df <- do.call(rbind, lapply(seq(1,20,4), function(i) data.frame(x=rnorm(50,mean=i,sd=1),y=rnorm(50,mean=i,sd=1),cluster=i)))
df$cluster <- factor(df$cluster)
Step 3: Create Additional Data Frame for Labels
Create an additional data frame with cluster labels.
label.df <- data.frame(cluster = levels(df$cluster), label = paste0("cluster: ", levels(df$cluster)))
Step 4: Summarize XY Coordinates and Merge with Label Data
Summarize the XY coordinates for each cluster by calculating the minimum x-coordinate and maximum y-coordinate.
label.df_2 <- df %>%
group_by(cluster) %>%
summarize(x = min(x), y = max(y)) %>%
left_join(label.df)
Step 5: Create the Plot with Text Annotations
Create a scatter plot with tSNE clustering and add text annotations using geom_label_repel.
ggplot(df, aes(x=x,y=y,color=cluster))+geom_point()+theme_minimal()+theme(legend.position="none") +
ggrepel::geom_label_repel(data = label.df_2, aes(label = label))
Conclusion
By following the steps outlined in this article, you can easily add text annotations to a clustering scatter plot created with tSNE using R and the ggrepel package. This technique is particularly useful for visualizing high-dimensional data and communicating the underlying structure of the data.
The ggrepel package provides an efficient way to position labels without overlapping other elements on the plot, making it easier to create informative and visually appealing graphics.
We hope this article has been helpful in demonstrating how to add text annotations to a clustering scatter plot created with tSNE using R and the ggrepel package.
Last modified on 2023-07-10