我有一个超过10000行的大型数据集:df:
User duration
amy 582
amy 27
amy 592
amy 16
amy 250
tom 33
tom 10
tom 40
tom 100
所需输出:
User duration
amy 582
amy 592
amy 250
tom 33
tom 10
tom 40
从本质上讲,这将从每个唯一的用户平均值中删除任何2SD的异常值。该代码将获取每个唯一用户的平均值,确定其平均值和标准偏差,然后删除平均值大于2SD的值。
dput:
structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(582L,
27L, 592L, 16L, 250L, 33L, 10L, 40L, 100L)), class = "data.frame", row.names = c(NA,
-9L))
这就是我尝试过的:
first define average and standard deviation
ave = ave(df$duration)
sd = sd(df$duration)
然后设置一些参数:
for i in df {
remove all if > 2*sd}
我不确定,希望得到一些建议。
我们可以使用dplyr
,当与between
一起使用时,它会非常简洁
library(dplyr)
df %>%
group_by(User) %>%
filter(between(duration, mean(duration) - sd(duration),
mean(duration) + sd(duration)))
您可以使用scale()
来查找z分数,并保持绝对值小于2:
library(dplyr)
df %>%
group_by(User) %>%
filter(abs(scale(duration)) < 2)
# A tibble: 9 x 2
# Groups: User [2]
User duration
<fct> <int>
1 amy 582
2 amy 27
3 amy 592
4 amy 16
5 amy 250
6 tom 33
7 tom 10
8 tom 40
9 tom 100
这里有一种data.table方法,对于许多行来说可能更快。
library(data.table)
df <- structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(50000,
582, 27, 592, 16, 250, 33, 10, 40, 100)), row.names = c(NA, -10L
), class = "data.frame")
df
User duration
1 amy 50000
2 amy 582
3 amy 27
4 amy 592
5 amy 16
6 amy 250
7 tom 33
8 tom 10
9 tom 40
10 tom 100
代码
setDT(df)[,.SD[duration <= mean(duration) + (2 * sd(duration)) &
duration >= mean(duration) - (2 * sd(duration)),]
,by=User]
User duration
1: amy 582
2: amy 27
3: amy 592
4: amy 16
5: amy 250
6: tom 33
7: tom 10
8: tom 40
9: tom 100
我们可以尝试在dplyr
中使用mutate
和filter
函数
library(dplyr)
df %>% group_by(User) %>% mutate(ave_plus2sd=ave(duration)+2*sd(duration)) %>%
filter(duration < ave_plus2sd)
这将为您提供以下输出,允许将每个条目与用户的平均+2*sd进行比较。
# Groups: User [2]
User duration ave_plus2sd
<fct> <int> <dbl>
1 amy 582 861.
2 amy 27 861.
3 amy 592 861.
4 amy 16 861.
5 amy 250 861.
6 tom 33 122.
7 tom 10 122.
8 tom 40 122.
9 tom 100 122.
我们可以进一步添加%>% select (User,duration)
来选择感兴趣的列User和duration。