r-将分组方式添加到具有分组频率和比例的数据帧中

  • 本文关键字:数据帧 频率 方式 添加 r dplyr
  • 更新时间 :
  • 英文 :


我有这样的数据:

df <- data.frame(
id = sample(1:5, 100, replace = TRUE),
dur = sample(c(NA, rnorm(10)), 100, replace = TRUE),
char = sample(LETTERS, 100, replace = TRUE)
)

由此,我可以计算变量char:的计数和比例

library(dplyr)
df %>%
filter(!is.na(dur) & id != lag(id)) %>%
count(char, name = 'freq', sort = TRUE) %>%
mutate(prop = prop.table(freq) * 100)
char freq     prop
1     C    6 8.571429
2     M    6 8.571429
3     X    5 7.142857
4     Y    5 7.142857
5     Z    5 7.142857
6     E    4 5.714286
7     I    4 5.714286
8     K    4 5.714286
9     J    3 4.285714
10    Q    3 4.285714
... clipped

现在,在df中,char值也具有dur值。所以我想添加另一列,比如mean_dur,其中dur的平均值由df中的char分组。在上面的代码中添加group_by(char)等内容是不起作用的,因为变量char不再被识别。如何做到这一点?

编辑:它可以分步骤完成,比如:

# Step 1 -- make df with counts and proportions:
df1 <- df %>%
filter(!is.na(dur) & id != lag(id)) %>%
count(char, name = 'freq', sort = TRUE) %>%
mutate(prop = prop.table(freq) * 100)
# Step 2 -- make another df with mean dur values:
df2 <- df %>%
filter(!is.na(dur) & id != lag(id)) %>%
group_by(char) %>%
summarise(mean_dur = mean(dur, na.rm = TRUE))
# Step 3 -- transfer mean dur values by matching `char`in `df1`and `df2`
df1$mean_dur <- df2$mean_dur[match(df1$char, df2$char)]

但有没有更干净、更整洁的dplyr方式?

编辑2

多亏了@Anousiravan R的解决方案,我从中选择了left_join的想法,这似乎是一个干净整洁的解决方案(而且它不需要包janitor(:

df %>%
filter(!is.na(dur) & id != lag(id)) %>%
count(char, name = 'freq', sort = TRUE) %>%
mutate(prop = prop.table(freq) * 100) %>%
left_join(df %>%
filter(!is.na(dur) & id != lag(id)) %>%
group_by(char) %>%
summarise(mean_dur = mean(dur)), by = "char")

我希望这就是您想要的:

library(dplyr)
library(janitor)
df %>%
filter(!is.na(dur) & !id == lag(id)) %>%
tabyl(char) %>%
rename(freq = percent) %>%
mutate(freq = freq * 100) %>%
select(-n) %>%
arrange(desc(freq)) %>% 
left_join(df %>%
filter(!is.na(dur) & id != lag(id)) %>%
group_by(char) %>%
summarise(mean_dur = mean(dur)), by = "char")

char     freq   mean_dur
T 7.894737 -0.4861708
Z 7.894737 -0.2867046
A 6.578947 -0.5056797
B 5.263158  0.3513478
E 5.263158  0.5113139
K 5.263158 -1.4560764
L 5.263158  0.8235192
N 5.263158  0.9037481
X 5.263158 -1.4669529
C 3.947368 -0.4064762
I 3.947368 -0.7722133
P 3.947368 -0.1076928
U 3.947368  0.5573875
Y 3.947368  0.2404896
D 2.631579  0.5942473
F 2.631579  1.2381883
G 2.631579 -0.2155605
J 2.631579  1.0528329
M 2.631579 -1.5482806
O 2.631579  0.2813264
S 2.631579  1.2132490
V 2.631579  0.6157874
H 1.315789 -1.2664754
Q 1.315789  1.1027114
R 1.315789  0.1288634
W 1.315789  1.0528329

如果你准备放弃prop.table,那么我认为这会给你带来你想要的。。。

df %>%
filter(!is.na(dur) & id != lag(id)) %>%
group_by(char) %>% 
summarise(
n=n(),
prop = 100*n/nrow(.),
mean_dur=mean(dur, na.rm=TRUE),
.groups="drop"
)
# A tibble: 25 x 4
char      n  prop mean_dur
* <fct> <int> <dbl>    <dbl>
1 A         6  8.82    0.158
2 B         5  7.35   -0.144
3 C         2  2.94    0.951
4 D         2  2.94    0.518
5 E         5  7.35    0.211
6 F         3  4.41    0.333
7 G         2  2.94    0.951
8 H         3  4.41    0.624
9 I         2  2.94   -0.422
10 J         2  2.94   -0.347
# … with 15 more rows

[我花了一段时间才注意到你在处理随机数据。set.seed()会很有帮助!;=(]

根据下面的注释进行编辑

另一个选项:

mean_dur <- df %>% group_by(char) %>% summarise(mean_dur=mean(dur,na.rm=T))
tab <- df %>%
filter(!is.na(dur) & id != lag(id)) %>%
count(char, name = 'freq') %>%
mutate(prop = prop.table(freq) * 100)
tab <- merge.data.frame(tab,mean_dur)
tab <- tab[order(tab$freq,decreasing = T),]
char freq     prop    mean_dur
17    R    6 8.108108 -0.75610907
3     D    5 6.756757 -0.61657511
5     F    5 6.756757 -0.34153689
10    K    5 6.756757 -0.90688768
19    T    5 6.756757  0.33628707
6     G    4 5.405405 -0.93390134
9     J    4 5.405405  0.27471673
11    L    4 5.405405  0.87029782
13    N    4 5.405405  0.17163797
16    Q    4 5.405405 -0.67554378
22    X    4 5.405405 -0.42108346
7     H    3 4.054054  0.36290234
14    O    3 4.054054 -0.56712470
15    P    3 4.054054  0.08316665
2     C    2 2.702703 -1.15398142
4     E    2 2.702703 -0.31271923
12    M    2 2.702703 -0.96001502
18    S    2 2.702703 -0.88921047
20    U    2 2.702703  0.24299241
21    W    2 2.702703 -1.32772406
1     A    1 1.351351  0.24299241
8     I    1 1.351351 -1.07336407
23    Z    1 1.351351 -1.07336407

最新更新