我正试图使用连续变量和类别变量,按照kmeans方法构建一个聚类模型。
目标是根据性别、年龄、职业、计费计划、手机和某些应用程序的使用情况创建集群
我正在研究如何处理分类数据,我知道我应该把它们变成假人,但不太确定如何同时处理所有分类变量。
谢谢
表格看起来像:
ID性别年龄职业计划手机亚马逊Prime GB DL苹果音乐GB DL音频DB DLC001 NR 56三星学生会0 0 0.498829165C002 M 25管理层马拉维华为0 0 1C003 H32 Archaius Apple教授0 0.6320005841
从分类变量创建伪变量的一个可能的解决方案是"fastDummies";包装:
library(fastDummies)
df <- data.frame(NR1 = c(1,2,3),
NR2 = c(0.1, 0.5, 0.7),
FA1 = factor(c("A","B","C")),
FA2 = factor(c("5","6","7")))
str(df)
'data.frame': 3 obs. of 4 variables:
$ NR1: num 1 2 3
$ NR2: num 0.1 0.5 0.7
$ FA1: Factor w/ 3 levels "A","B","C": 1 2 3
$ FA2: Factor w/ 3 levels "5","6","7": 1 2 3
# one variable per factor level
fastDummies::dummy_cols(df)
NR1 NR2 FA1 FA2 FA1_A FA1_B FA1_C FA2_5 FA2_6 FA2_7
1 1 0.1 A 5 1 0 0 1 0 0
2 2 0.5 B 6 0 1 0 0 1 0
3 3 0.7 C 7 0 0 1 0 0 1
# encoding where where there are n-1 columns per factor (as in case of all being 0 it implies the last is 1 already)
fastDummies::dummy_cols(df, remove_first_dummy = TRUE)
NR1 NR2 FA1 FA2 FA1_B FA1_C FA2_6 FA2_7
1 1 0.1 A 5 0 0 0 0
2 2 0.5 B 6 1 0 1 0
3 3 0.7 C 7 0 1 0 1