Performing a cluster analysis in R

The code below (thanks to Emily Ruzich) uses R to perform various commonly used procedures to identify variable clusters including k-means and hierarchical approaches and also assess these using a scree plot based upon the within cluster variance and a measure of the between to within cluster variances (the average width of silhouette) as suggested by Kaufman and Rousseau (1990). Plots can also be saved to pdf files within R. A pdf primer (Zakharov, 2016) on using k-means clustering in psychology is here.

aqdat <- read.csv("U://My Documents//aq 4 point data_reduced.csv")

# the deparse and sink commands not needed and can be ignored unless putting all the commands here in a single function
# the deparse function tells R to treat all inputs to deparse as string variables regardless of whether or not they are functions or data sets; useful for using as strings
later in a function

#filename <- deparse(substitute(aqdat)) 
#sink(paste("sib_output_cluster_",filename,".txt",sep=""), split=F, append=T, type="output")

# Clustering ------------------------------------------------------------
## http://www.statmethods.net/advstats/cluster.html
## See also Grayson ppt


# compute sums of squares of each variables = (number of rows-1) x sum of variances of each column = (N-1)(Sum of variable Variances)

wss <- (nrow(aqdat)-1)*sum(apply(aqdat,2,var))

# as a check for the below wss will correspond to the sum of variable variances for one cluster ie for i=1 (the first element of wss below)
# works out within cluster sum of squares for 2 to 10 cluster solutions
for (i in 2:10) wss[i] <- sum(kmeans(aqdat,centers=i)$withinss)

# scree plot comparing the total of within cluster sums of squares with the number of clusters; As with factor analysis choose the number of clusters
# where there is a kink as shows the clusters are stabilising
plot(1:10, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
#can save the plot in a pdf file

# alternate method - prints the suggested number of clusters based on optimum average silhouette width
#Average Silhouette Width measures how much more similar a data point is to 
#the points in its own cluster than to points in a neighbour cluster. The average 
#silhouette width is calculated as the average of the silhouette width for all the 
#points in the data - the higher the better - see L. Kaufman, P. Rousseeuw. 
#Finding Groups in Data: An Introduction to Cluster Analysis. Wiley. 1990.

# SEE ALSO JESS SHIN THESUS PDF FILE IN THE FOLDER (EMILY RUZICH CLUSTERING IN H RING #FOLDER/CLUSTER ANALYSES TAKEN FROM #http://www.eecis.udel.edu/~compbio/thesis/JessShenThesis.pdf)

install.packages("fpc") 
library(fpc)
aa <- pamk(aqdat,krange=2:10); print("OUTPUT FOR PAMK FUNCTION"); print(aa)

### 2 clusters ----------------------------------------------------------
  
### K-Means Cluster Analysis
fit_kmeans_2 <- kmeans(aqdat, 2)

# get cluster descriptive data
km2_means <- aggregate(aqdat,by=list(fit_kmeans_2$cluster),FUN=mean); print("KM2 MEANS"); print(km2_means)
km2_sds <- aggregate(aqdat,by=list(fit_kmeans_2$cluster),FUN=sd); print("KM2 SDS"); print(km2_sds)
km2_ranges <- aggregate(aqdat,by=list(fit_kmeans_2$cluster),FUN=range); print("KM2 RANGES"); print(km2_ranges)

# append kmeans cluster assignment
aqdat2   <- data.frame(aqdat, fit_kmeans_2$cluster)

# repeat for 3 cluster solution
fit_kmeans_3 <- kmeans(aqdat, 3)
  
# get cluster descriptive data
km3_means <- aggregate(aqdat,by=list(fit_kmeans_3$cluster),FUN=mean); print("KM3 MEANS"); print(km3_means)
km3_sds <- aggregate(aqdat,by=list(fit_kmeans_3$cluster),FUN=sd); print("KM3 SDS"); print(km3_sds)
km3_ranges <- aggregate(aqdat,by=list(fit_kmeans_3$cluster),FUN=range); print("KM3 RANGES"); print(km3_ranges)
  
  # append kmeans cluster assignment
mydata <- data.frame(aqdat, fit_kmeans_3$cluster)


### Ward Hierarchical Clustering
library(fpc)
d <- dist(aqdat, method = "euclidean") # distance matrix
  
fit_ward_2 <- hclust(d, method="ward") 
plot(fit_ward_2) # display dendogram
clustnumber2 <- cutree(fit_ward_2, k=2) # cut tree into 2 clusters
rect.hclust(fit_ward_2, k=2, border="red") # draw dendogram with red borders around the 2 clusters 
aqdat2 <- data.frame(aqdat, clustnumber2)

fit_ward_3 <- hclust(d, method="ward") 
  
 #title <- paste("sibplots_cluster_",filename,"_warddendrogram",sep="")
#pdf(paste(title,".pdf",sep=""))
plot(fit_ward_3) # display dendogram
clustnumber3 <- cutree(fit_ward_3, k=3) # cut tree into 3 clusters
rect.hclust(fit_ward_3, k=3, border="red") # draw dendogram with red borders around the 3 clusters 
dev.off()

library(cluster)
par(mfrow=c(1,2)) 

# comparing cluster solutions from, k-means
# displays 2 and 3 cluster solutions on one plot
clusplot(aqdat, fit_kmeans_2$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
clusplot(aqdat, fit_kmeans_3$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

#sink()
write.csv(mydata,file=paste("U://My Documents//aqdat,".with.cluster.assignments.csv",sep=""))

In addition Silhouette plots can be used (e.g. using the pam function in R) to choose an optimal number of clusters (see here). Cluster solutions which have clusters which have higher silhouette widths (the nearer to 1 the better) are preferred since higher silhouette widths represent more distinctive clusters.

References

Kaufman, L. and Rousseeuw, P. (1990) Finding Groups in Data: An Introduction to Cluster Analysis. Wiley.

Smith, L. (2018) Cooking up statistics: the science and the art. Significance 15(5) 28-32. Award winning article featuring uses of agglomeration clustering, dendrograms and Euclidean and Manhattan distances.

Zakharov, K. (2016) Application of k-means clustering in psychological studies. The Quantitative Methods for Psychology, 12(2) 87-100. doi:10.20982/tqmp.12.2.p087

More on clustering methods is given in the pdf here which was downloaded from here.