I use following tsclust statement to cluster data
SURFSKINTEMP_CLUST <- tsclust(SURFSKINTEMP, k = 10L:20L,
distance = "dtw_basic", centroid = "dba",
trace = TRUE, seed = 938,
norm = "L2", window.size = 2L,
args = tsclust_args(cent = list(trace = TRUE)))
SURFSKINTEMP is very big,
str(SURFSKINTEMP)
List of 327239
$ V1 : num [1:7] 0.13 0.631 -0.178 0.731 0.86 ...
$ V2 : num [1:6] 0.117 -0.693 -0.911 -0.911 -0.781 ...
$ V3 : num [1:7] 0.117 -0.693 -0.911 -0.911 -0.781 ...
$ V4 : num [1:6] -0.693 -0.911 -0.911 -0.781 -0.604 ...
Then, I want use cvi to evaluate the optimum number of clusters “k”
names(SURFSKINTEMP_CLUST) <- paste0("k_",10L:20L)
sapply(SURFSKINTEMP_CLUST, cvi, type = "internal")
But, there have an errors
> sapply(SURFSKINTEMP_CLUST, cvi, type = "internal")
Error: cannot allocate vector of size 797.8 Gb
How can I evaluate the optimum number of clusters “k” in my case?
Specifying
type = "internal"will try to calculate 7 indices: Silhouette, Dunn, COP, DB, DB*, CH and SF. As mentioned in the documentation forcvi, the first 3 will try to calculate the whole cross-distance matrix, which in your case would be a327,239 x 327,239matrix; you're going to have a hard time finding a computer that can allocate that, and it would take a long time to compute.Since you're using DBA for centroids, you could see if DB or DB* make sense for your application
You could also look at the somewhat simple elbow method bearing in mind that you could calculate the sum of squared error (SSE) with (see documentation for
TSClusters-class):