我有一个数据帧,看起来像这样:
y<-c("A1","B1", "C2", "A1", "B1","C1", "A1","B2", "C3", "A1", "B1", "C4", "A1", "B1","C4", "A1","B2", "C4", "A1","B1", "C4", "A1", "B1", "C4")
test<- data.frame(matrix(y, nrow = 3, ncol = 8))
colnames(test) <- c("Learn_1", "Car_1", "Car_2", "Fan_1", "Fan_2", "Fan_3","Kart_1", "God_1")
test
Learn_1 Car_1 Car_2 Fan_1 Fan_2 Fan_3 Kart_1 God_1
1 A1 A1 A1 A1 A1 A1 A1 A1
2 B1 B1 B2 B1 B1 B2 B1 B1
3 C2 C1 C3 C4 C4 C4 C4 C4
我的实际数据有13列,长度不等,有数千行,值是混合的。我想确定God_ 1中的每个值到所有其他列的频率,但对于具有相同单词的每一列(意味着列来自同一研究)(即Fan和Car列,如果该值在这些列中出现多次,则将该值的频率计为1。然后,我想绘制显示为5,4,3,2,1的值的百分比与GOD_1中可用值的总百分比(100%)的关系图。我在想一个框,它显示了值的总数,然后是区分频率值的不同百分比阴影(1,2,3,4,5)。我的情节应该最小值为1,最大值为5(有5个唯一的专栏词)。
我的问题是,我不知道如何开始,但在过去几天里我一直在思考这个问题。有人有想法吗?
这些频率显示多少次取决于我想要的:
A1 = 5
B1 = 5
C4 = 3
这是我的例子的str,我的真实数据看起来是这样的,但有2366个obs.在13个变量中,各种因子w/一些级别(范围从200:3000)
str(test)
'data.frame': 3 obs. of 8 variables:
$ Learn_1: Factor w/ 3 levels "A1","B1","C2": 1 2 3
$ Car_1 : Factor w/ 3 levels "A1","B1","C1": 1 2 3
$ Car_2 : Factor w/ 3 levels "A1","B2","C3": 1 2 3
$ Fan_1 : Factor w/ 3 levels "A1","B1","C4": 1 2 3
$ Fan_2 : Factor w/ 3 levels "A1","B1","C4": 1 2 3
$ Fan_3 : Factor w/ 3 levels "A1","B2","C4": 1 2 3
$ Kart_1 : Factor w/ 3 levels "A1","B1","C4": 1 2 3
$ God_1 : Factor w/ 3 levels "A1","B1","C4": 1 2 3
我们可以使用dplyr
和tidyr
。
首先,数据被gather
转换为宽格式,然后我们从标签中separate
数字部分,使用distinct
删除重复项,计算所有出现的次数,并使用left_join只查看God_1列中的数据。
library(dplyr)
library(tidyr)
dat %>% gather(key, val) %>%
separate(key, c("id", "num")) %>%
distinct(id, val) %>%
count(val) %>%
left_join(dat["God_1"], ., by = c(God_1 = "val"))
Source: local data frame [3 x 2]
God_1 out
(fctr) (dbl)
1 A1 5
2 B1 5
3 C4 3