我有这样的数据:
df <- data.frame(
id = sample(1:5, 100, replace = TRUE),
dur = sample(c(NA, rnorm(10)), 100, replace = TRUE),
char = sample(LETTERS, 100, replace = TRUE)
)
由此,我可以计算变量char
:的计数和比例
library(dplyr)
df %>%
filter(!is.na(dur) & id != lag(id)) %>%
count(char, name = 'freq', sort = TRUE) %>%
mutate(prop = prop.table(freq) * 100)
char freq prop
1 C 6 8.571429
2 M 6 8.571429
3 X 5 7.142857
4 Y 5 7.142857
5 Z 5 7.142857
6 E 4 5.714286
7 I 4 5.714286
8 K 4 5.714286
9 J 3 4.285714
10 Q 3 4.285714
... clipped
现在,在df
中,char
值也具有dur
值。所以我想添加另一列,比如mean_dur
,其中dur
的平均值由df
中的char
分组。在上面的代码中添加group_by(char)
等内容是不起作用的,因为变量char
不再被识别。如何做到这一点?
编辑:它可以分步骤完成,比如:
# Step 1 -- make df with counts and proportions:
df1 <- df %>%
filter(!is.na(dur) & id != lag(id)) %>%
count(char, name = 'freq', sort = TRUE) %>%
mutate(prop = prop.table(freq) * 100)
# Step 2 -- make another df with mean dur values:
df2 <- df %>%
filter(!is.na(dur) & id != lag(id)) %>%
group_by(char) %>%
summarise(mean_dur = mean(dur, na.rm = TRUE))
# Step 3 -- transfer mean dur values by matching `char`in `df1`and `df2`
df1$mean_dur <- df2$mean_dur[match(df1$char, df2$char)]
但有没有更干净、更整洁的dplyr
方式?
编辑2:
多亏了@Anousiravan R的解决方案,我从中选择了left_join
的想法,这似乎是一个干净整洁的解决方案(而且它不需要包janitor
(:
df %>%
filter(!is.na(dur) & id != lag(id)) %>%
count(char, name = 'freq', sort = TRUE) %>%
mutate(prop = prop.table(freq) * 100) %>%
left_join(df %>%
filter(!is.na(dur) & id != lag(id)) %>%
group_by(char) %>%
summarise(mean_dur = mean(dur)), by = "char")
我希望这就是您想要的:
library(dplyr)
library(janitor)
df %>%
filter(!is.na(dur) & !id == lag(id)) %>%
tabyl(char) %>%
rename(freq = percent) %>%
mutate(freq = freq * 100) %>%
select(-n) %>%
arrange(desc(freq)) %>%
left_join(df %>%
filter(!is.na(dur) & id != lag(id)) %>%
group_by(char) %>%
summarise(mean_dur = mean(dur)), by = "char")
char freq mean_dur
T 7.894737 -0.4861708
Z 7.894737 -0.2867046
A 6.578947 -0.5056797
B 5.263158 0.3513478
E 5.263158 0.5113139
K 5.263158 -1.4560764
L 5.263158 0.8235192
N 5.263158 0.9037481
X 5.263158 -1.4669529
C 3.947368 -0.4064762
I 3.947368 -0.7722133
P 3.947368 -0.1076928
U 3.947368 0.5573875
Y 3.947368 0.2404896
D 2.631579 0.5942473
F 2.631579 1.2381883
G 2.631579 -0.2155605
J 2.631579 1.0528329
M 2.631579 -1.5482806
O 2.631579 0.2813264
S 2.631579 1.2132490
V 2.631579 0.6157874
H 1.315789 -1.2664754
Q 1.315789 1.1027114
R 1.315789 0.1288634
W 1.315789 1.0528329
如果你准备放弃prop.table
,那么我认为这会给你带来你想要的。。。
df %>%
filter(!is.na(dur) & id != lag(id)) %>%
group_by(char) %>%
summarise(
n=n(),
prop = 100*n/nrow(.),
mean_dur=mean(dur, na.rm=TRUE),
.groups="drop"
)
# A tibble: 25 x 4
char n prop mean_dur
* <fct> <int> <dbl> <dbl>
1 A 6 8.82 0.158
2 B 5 7.35 -0.144
3 C 2 2.94 0.951
4 D 2 2.94 0.518
5 E 5 7.35 0.211
6 F 3 4.41 0.333
7 G 2 2.94 0.951
8 H 3 4.41 0.624
9 I 2 2.94 -0.422
10 J 2 2.94 -0.347
# … with 15 more rows
[我花了一段时间才注意到你在处理随机数据。set.seed()
会很有帮助!;=(]
根据下面的注释进行编辑
另一个选项:
mean_dur <- df %>% group_by(char) %>% summarise(mean_dur=mean(dur,na.rm=T))
tab <- df %>%
filter(!is.na(dur) & id != lag(id)) %>%
count(char, name = 'freq') %>%
mutate(prop = prop.table(freq) * 100)
tab <- merge.data.frame(tab,mean_dur)
tab <- tab[order(tab$freq,decreasing = T),]
char freq prop mean_dur
17 R 6 8.108108 -0.75610907
3 D 5 6.756757 -0.61657511
5 F 5 6.756757 -0.34153689
10 K 5 6.756757 -0.90688768
19 T 5 6.756757 0.33628707
6 G 4 5.405405 -0.93390134
9 J 4 5.405405 0.27471673
11 L 4 5.405405 0.87029782
13 N 4 5.405405 0.17163797
16 Q 4 5.405405 -0.67554378
22 X 4 5.405405 -0.42108346
7 H 3 4.054054 0.36290234
14 O 3 4.054054 -0.56712470
15 P 3 4.054054 0.08316665
2 C 2 2.702703 -1.15398142
4 E 2 2.702703 -0.31271923
12 M 2 2.702703 -0.96001502
18 S 2 2.702703 -0.88921047
20 U 2 2.702703 0.24299241
21 W 2 2.702703 -1.32772406
1 A 1 1.351351 0.24299241
8 I 1 1.351351 -1.07336407
23 Z 1 1.351351 -1.07336407