4. Clustering

For the actual clustering of the time-series features, the main aim is to organize participants into groups so that the features of participants within a group are as similar as possible, while the features of people in different groups are as different as possible (Liao, 2005). There are a many different ways in which (time series) features can be clustered into homogeneous and well separated groups. We provide a more in-depth discussion of the different options in the main manuscript and we recommend the excellent overview of Xu & Tian (2015) for a more general description of the different available approaches. Briefly speaking, the more readily available approaches suitable for most time series feature data can, broadly speaking, be categorized as based on (1) centroids, (2) distributions, (3) density, (4) hierarchies, or (5) a combination thereof (A. K. Jain et al., 1999). There is, unfortunately, no one-size-fits-all solution to clustering and users will usually have to make an informed decision based on the structure of their data as well as an appropriate weighing of accuracy and efficiency. For our own illustration, we have chosen the centroid-based k-means clustering.

K-means Clustering

While k-means may not always deliver the highest accuracy, it brings forth distinct benefits. We selected k-means primarily due to its notable efficiency, which handles vast numbers of participants and features without imposing numerous limiting assumptions on cluster shapes (Anil K. Jain, 2010). The k-means technique has garnered recognition in the research world and has found its way into numerous statistical software tools (Hand & Krzanowski, 2005). Moreover, a significant number of feature selection approaches have been tailored explicitly for the reputable k-means algorithm (Boutsidis et al., 2010). Therefore, k-means serves as an excellent foundation for many in the field of psychological research, and its applicability spans a diverse array of projects.

In practice, we entered the participants’ PC-scores from the feature reduction step into the k-means algorithm. Because we did not know the underlying number of clusters within our sample, we calculated the cluster solutions for \(k=\{2, \dots , 10\}\) . To avoid local minima we used 100 random initial centroid positions for each run. We here use a for-loop to arrive at these cluster solutions using the kmeans() function from the stats package (R Core Team, 2024) and save the results in the kmeans_results list.

kmeans_results <- list()
for (i in 2:10) {
  kmeans_results[[i-1]] <- kmeans(pca_scores, centers = i, nstart = 100)
}

Each of the 9 cluster solutions converged within the iteration limit. In the next step, we will then evaluate which of the extracted cluster solutions offers the best fit with the data.

References

Boutsidis, C., Drineas, P., & Mahoney, M. W. (2010). Unsupervised feature selection for the k-means clustering problem. NIPS’09: Proceedings of the 22nd International Conference on Neural Information Processing Systems, 153–161.

Hand, D. J., & Krzanowski, W. J. (2005). Optimising k-means clustering results with standard software packages. Computational Statistics & Data Analysis, 49(4), 969–973. https://doi.org/10.1016/j.csda.2004.06.017

Jain, Anil K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651–666. https://doi.org/10.1016/j.patrec.2009.09.011

Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323. https://doi.org/10.1145/331499.331504

Liao, T. W. (2005). Clustering of time series data—a survey. Pattern Recognition, 38(11), 1857–1874. https://doi.org/10.1016/j.patcog.2005.01.025

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Xu, D., & Tian, Y. (2015). A Comprehensive Survey of Clustering Algorithms. Annals of Data Science, 2(2), 165–193. https://doi.org/10.1007/s40745-015-0040-1