下午好,
假设我们有来自UCI的以下数据集:
ballons=structure(list(YELLOW = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("PURPLE",
"YELLOW"), class = "factor"), SMALL = structure(c(2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L
), .Label = c("LARGE", "SMALL"), class = "factor"), STRETCH = structure(c(2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L,
1L, 1L), .Label = c("DIP", "STRETCH"), class = "factor"), ADULT = structure(c(1L,
2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L,
1L, 2L), .Label = c("ADULT", "CHILD"), class = "factor"), T = c(TRUE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE)), class = "data.frame", row.names = c(NA,
-19L))
# output :
YELLOW SMALL STRETCH ADULT T
1 YELLOW SMALL STRETCH ADULT TRUE
2 YELLOW SMALL STRETCH CHILD FALSE
3 YELLOW SMALL DIP ADULT FALSE
4 YELLOW SMALL DIP CHILD FALSE
5 YELLOW LARGE STRETCH ADULT TRUE
6 YELLOW LARGE STRETCH ADULT TRUE
7 YELLOW LARGE STRETCH CHILD FALSE
8 YELLOW LARGE DIP ADULT FALSE
9 YELLOW LARGE DIP CHILD FALSE
10 PURPLE SMALL STRETCH ADULT TRUE
11 PURPLE SMALL STRETCH ADULT TRUE
12 PURPLE SMALL STRETCH CHILD FALSE
13 PURPLE SMALL DIP ADULT FALSE
14 PURPLE SMALL DIP CHILD FALSE
15 PURPLE LARGE STRETCH ADULT TRUE
16 PURPLE LARGE STRETCH ADULT TRUE
17 PURPLE LARGE STRETCH CHILD FALSE
18 PURPLE LARGE DIP ADULT FALSE
19 PURPLE LARGE DIP CHILD FALSE
假设我还应用了一个聚类算法来获得如下结果:
clusterss=data.frame(index=1:19,class=c(1,2,3,3,3,2,3,1,2,3,3,2,2,3,2,2,1,1,2))
> clusterss
index class
1 1 1
2 2 2
3 3 3
4 4 3
5 5 3
6 6 2
7 7 3
8 8 1
9 9 2
10 10 3
11 11 3
12 12 2
13 13 2
14 14 3
15 15 2
16 16 2
17 17 1
18 18 1
19 19 2
这里,index
变量表示ballons
行,而class
是所获得的ballons
行所属的簇。
我知道我们可以通过来计算所有分类变量的频率
> sapply(ballons,table)
y1 y2 y3 y4 y5
PURPLE 10 10 8 11 12
YELLOW 9 9 11 8 7
然而,我需要为每个集群独立地计算这个值。这意味着我需要(为每个类别(选择他们相关的观测值,然后我可以计算频率。例如,class=1:
# Expected results for the first cluster : class == 1
result1 <- filter(clusterss, class == 1)
sapply(ballons[result1[,1],],table)
y1 y2 y3 y4 y5
PURPLE 2 3 2 3 3
YELLOW 2 1 2 1 1
# Expected results for the second cluster : class == 2
result2 <- filter(clusterss, class == 2)
sapply(ballons[result2[,1],],table)
y1 y2 y3 y4 y5
PURPLE 5 5 3 4 5
YELLOW 3 3 5 4 3
# Expected results for the third cluster : class == 3
result3 <- filter(clusterss, class == 3)
sapply(ballons[result3[,1],],table)
y1 y2 y3 y4 y5
PURPLE 3 2 3 4 4
YELLOW 4 5 4 3 3
我正在寻找一种有效的方法来获得这样的结果(可能使用dplyr
的select
函数(。谢谢你的帮助!
您可以为table
:提供一个额外的列,此处为clusterss$class
sapply(ballons,table, clusterss$class)
#lapply(ballons,table, clusterss$class) #Alternative
# YELLOW SMALL STRETCH ADULT T
#[1,] 2 3 2 3 3
#[2,] 2 1 2 1 1
#[3,] 5 5 3 4 5
#[4,] 3 3 5 4 3
#[5,] 3 2 3 4 4
#[6,] 4 5 4 3 3