计算字符串序列的平均值,然后删除R中任何大于平均值2SD的值



我有一个超过10000行的大型数据集:df:

User              duration
amy                582         
amy                27
amy                592
amy                16
amy                250
tom                33
tom                10
tom                40
tom                100

所需输出:

User               duration
amy                 582
amy                 592
amy                 250
tom                 33
tom                 10
tom                 40

从本质上讲,这将从每个唯一的用户平均值中删除任何2SD的异常值。该代码将获取每个唯一用户的平均值,确定其平均值和标准偏差,然后删除平均值大于2SD的值。

dput:

structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(582L, 
27L, 592L, 16L, 250L, 33L, 10L, 40L, 100L)), class = "data.frame", row.names = c(NA, 
-9L))

这就是我尝试过的:

first define average and standard deviation

ave = ave(df$duration)
sd =  sd(df$duration)

然后设置一些参数:

for i in df {
remove all if > 2*sd}

我不确定,希望得到一些建议。

我们可以使用dplyr,当与between一起使用时,它会非常简洁

library(dplyr)
df %>% 
group_by(User) %>%
filter(between(duration, mean(duration) -  sd(duration), 
mean(duration) +   sd(duration)))

您可以使用scale()来查找z分数,并保持绝对值小于2:

library(dplyr)
df %>%
group_by(User) %>%
filter(abs(scale(duration)) < 2)
# A tibble: 9 x 2
# Groups:   User [2]
User  duration
<fct>    <int>
1 amy        582
2 amy         27
3 amy        592
4 amy         16
5 amy        250
6 tom         33
7 tom         10
8 tom         40
9 tom        100

这里有一种data.table方法,对于许多行来说可能更快。

library(data.table)
df <- structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(50000, 
582, 27, 592, 16, 250, 33, 10, 40, 100)), row.names = c(NA, -10L
), class = "data.frame")
df
User duration
1   amy    50000
2   amy      582
3   amy       27
4   amy      592
5   amy       16
6   amy      250
7   tom       33
8   tom       10
9   tom       40
10  tom      100

代码

setDT(df)[,.SD[duration <= mean(duration) + (2 * sd(duration)) &
duration >= mean(duration) - (2 * sd(duration)),]
,by=User]
User duration
1:  amy      582
2:  amy       27
3:  amy      592
4:  amy       16
5:  amy      250
6:  tom       33
7:  tom       10
8:  tom       40
9:  tom      100

我们可以尝试在dplyr中使用mutatefilter函数

library(dplyr)
df %>% group_by(User) %>% mutate(ave_plus2sd=ave(duration)+2*sd(duration)) %>% 
filter(duration < ave_plus2sd) 

这将为您提供以下输出,允许将每个条目与用户的平均+2*sd进行比较。

# Groups:   User [2]
User  duration ave_plus2sd
<fct>    <int>       <dbl>
1 amy        582        861.
2 amy         27        861.
3 amy        592        861.
4 amy         16        861.
5 amy        250        861.
6 tom         33        122.
7 tom         10        122.
8 tom         40        122.
9 tom        100        122.

我们可以进一步添加%>% select (User,duration)来选择感兴趣的列User和duration。

相关内容

最新更新