r-如何根据条件计算分类变量的频率



下午好,

假设我们有来自UCI的以下数据集:

ballons=structure(list(YELLOW = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("PURPLE", 
"YELLOW"), class = "factor"), SMALL = structure(c(2L, 2L, 2L, 
2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L
), .Label = c("LARGE", "SMALL"), class = "factor"), STRETCH = structure(c(2L, 
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 
1L, 1L), .Label = c("DIP", "STRETCH"), class = "factor"), ADULT = structure(c(1L, 
2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 
1L, 2L), .Label = c("ADULT", "CHILD"), class = "factor"), T = c(TRUE, 
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, 
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE)), class = "data.frame", row.names = c(NA, 
-19L))
# output :
YELLOW SMALL STRETCH ADULT     T
1  YELLOW SMALL STRETCH ADULT  TRUE
2  YELLOW SMALL STRETCH CHILD FALSE
3  YELLOW SMALL     DIP ADULT FALSE
4  YELLOW SMALL     DIP CHILD FALSE
5  YELLOW LARGE STRETCH ADULT  TRUE
6  YELLOW LARGE STRETCH ADULT  TRUE
7  YELLOW LARGE STRETCH CHILD FALSE
8  YELLOW LARGE     DIP ADULT FALSE
9  YELLOW LARGE     DIP CHILD FALSE
10 PURPLE SMALL STRETCH ADULT  TRUE
11 PURPLE SMALL STRETCH ADULT  TRUE
12 PURPLE SMALL STRETCH CHILD FALSE
13 PURPLE SMALL     DIP ADULT FALSE
14 PURPLE SMALL     DIP CHILD FALSE
15 PURPLE LARGE STRETCH ADULT  TRUE
16 PURPLE LARGE STRETCH ADULT  TRUE
17 PURPLE LARGE STRETCH CHILD FALSE
18 PURPLE LARGE     DIP ADULT FALSE
19 PURPLE LARGE     DIP CHILD FALSE

假设我还应用了一个聚类算法来获得如下结果:

clusterss=data.frame(index=1:19,class=c(1,2,3,3,3,2,3,1,2,3,3,2,2,3,2,2,1,1,2))
> clusterss
index class
1      1     1
2      2     2
3      3     3
4      4     3
5      5     3
6      6     2
7      7     3
8      8     1
9      9     2
10    10     3
11    11     3
12    12     2
13    13     2
14    14     3
15    15     2
16    16     2
17    17     1
18    18     1
19    19     2

这里,index变量表示ballons行,而class是所获得的ballons行所属的簇。

我知道我们可以通过来计算所有分类变量的频率

> sapply(ballons,table)
y1 y2 y3 y4 y5
PURPLE 10 10  8 11 12
YELLOW  9  9 11  8  7

然而,我需要为每个集群独立地计算这个值。这意味着我需要(为每个类别(选择他们相关的观测值,然后我可以计算频率。例如,class=1:

# Expected results for the first cluster : class == 1
result1 <- filter(clusterss, class == 1)
sapply(ballons[result1[,1],],table)
y1 y2 y3 y4 y5
PURPLE  2  3  2  3  3
YELLOW  2  1  2  1  1
# Expected results for the second cluster : class == 2
result2 <- filter(clusterss, class == 2)
sapply(ballons[result2[,1],],table)
y1 y2 y3 y4 y5
PURPLE  5  5  3  4  5
YELLOW  3  3  5  4  3
# Expected results for the third cluster : class == 3
result3 <- filter(clusterss, class == 3)
sapply(ballons[result3[,1],],table)
y1 y2 y3 y4 y5
PURPLE  3  2  3  4  4
YELLOW  4  5  4  3  3

我正在寻找一种有效的方法来获得这样的结果(可能使用dplyrselect函数(。谢谢你的帮助!

您可以为table:提供一个额外的列,此处为clusterss$class

sapply(ballons,table, clusterss$class)
#lapply(ballons,table, clusterss$class) #Alternative
#     YELLOW SMALL STRETCH ADULT T
#[1,]      2     3       2     3 3
#[2,]      2     1       2     1 1
#[3,]      5     5       3     4 5
#[4,]      3     3       5     4 3
#[5,]      3     2       3     4 4
#[6,]      4     5       4     3 3

最新更新