r-pca和聚类分析，计算速度很慢

我的数据有30000行和140列，我正在尝试对数据进行聚类。我正在做一个主成分分析，然后使用大约12台电脑进行聚类分析。我从3000个观察结果中随机抽取了一个样本并进行了运行，运行主成分分析和层次聚类花了44分钟。

一位同事在SPSS中也做了同样的操作，而且花费的时间明显更短？知道为什么吗？

这是我的代码的简化版本，它运行良好，但在2000次以上的观测中速度非常慢。我包含了USArrest数据集，它非常小，所以它并不能真正代表我的问题，但显示了我正在努力做什么。我很犹豫是否发布一个大数据集，因为这似乎很粗鲁。

我不知道如何加快集群速度。我知道我可以对数据进行随机采样，然后使用预测函数为测试数据分配聚类。但最理想的情况是，我希望在集群中使用所有数据，因为数据是静态的，永远不会更改或更新。

library(factoextra)
library(FactoMineR)       
library(FactoInvestigate) 
## Data
# mydata = My data has 32,000 rows with 139 variables.
# example data with small data set 
data("USArrests")
mydata <- USArrests
## Initial PCA on mydata
res.pca <- PCA(mydata, ncp=4, scale.unit=TRUE, graph = TRUE)
Investigate(res.pca)  # this report is very helpful!  I determined to keep 12 PC and start with 3 clusters.
## Keep PCA dataset with only 2 PC
res.pca1 <- PCA(mydata, ncp=2, scale.unit=TRUE, graph = TRUE)
## Run a HC on the PC:  Start with suggested number of PC and Clusters 
res.hcpc <- HCPC(res.pca1, nb.clust=4, graph = FALSE)
## Dendrogram
fviz_dend(res.hcpc,
cex = 0.7, 
palette = "jco",
rect = TRUE, rect_fill = TRUE, 
rect_border = "jco", 
labels_track_height = 0.8 
)
## Cluster Viz
fviz_cluster(res.hcpc,
geom = "point",  
elipse.type = "convex", 
#repel = TRUE, 
show.clust.cent = TRUE, 
palette = "jco", 
ggtheme = theme_minimal(),
main = "Factor map"
)

#### Cluster 1: Means of Variables
res.hcpc$desc.var$quanti$'1'
#### Cluster 2: Means of Variables
res.hcpc$desc.var$quanti$'2'
#### Cluster 3: Means of Variables
res.hcpc$desc.var$quanti$'3'
#### Cluster 4: Means of Variables
res.hcpc$desc.var$quanti$'4'
#### Number of Observations in each cluster
cluster_hd = res.hcpc$data.clust$clust
summary(cluster_hd)

知道SPSS为什么这么快吗？

知道如何加快速度吗？我知道集群是劳动密集型的，但我不确定效率的阈值是多少，也不确定我的30000条记录和140个变量的数据。

其他一些集群包是否更高效？建议？

HCPC是使用Ward准则对主要组件进行的分层聚类。您可以使用k-means算法来代替聚类部分，这要快得多：分层聚类的时间复杂度为O(n³(，而k-means的复杂度为0(n(，其中n是观测次数。

由于k-means和Ward的层次聚类优化的标准是相同的(最小化聚类内的总方差(，如果需要保持层次性，可以使用具有大量聚类的第一个k-means(例如300(，然后在聚类的中心运行层次聚类。

相关内容

最新更新

热门标签：