计算字符串序列的平均值，然后删除R中任何大于平均值2SD的值

我有一个超过10000行的大型数据集：df:

User              duration
amy                582         
amy                27
amy                592
amy                16
amy                250
tom                33
tom                10
tom                40
tom                100

所需输出：

User               duration
amy                 582
amy                 592
amy                 250
tom                 33
tom                 10
tom                 40

从本质上讲，这将从每个唯一的用户平均值中删除任何2SD的异常值。该代码将获取每个唯一用户的平均值，确定其平均值和标准偏差，然后删除平均值大于2SD的值。

dput:

structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(582L, 
27L, 592L, 16L, 250L, 33L, 10L, 40L, 100L)), class = "data.frame", row.names = c(NA, 
-9L))

这就是我尝试过的：

first define average and standard deviation

ave = ave(df$duration)
sd =  sd(df$duration)

然后设置一些参数：

for i in df {
remove all if > 2*sd}

我不确定，希望得到一些建议。

我们可以使用dplyr，当与between一起使用时，它会非常简洁

library(dplyr)
df %>% 
group_by(User) %>%
filter(between(duration, mean(duration) -  sd(duration), 
mean(duration) +   sd(duration)))

您可以使用scale()来查找z分数，并保持绝对值小于2:

library(dplyr)
df %>%
group_by(User) %>%
filter(abs(scale(duration)) < 2)
# A tibble: 9 x 2
# Groups:   User [2]
User  duration
<fct>    <int>
1 amy        582
2 amy         27
3 amy        592
4 amy         16
5 amy        250
6 tom         33
7 tom         10
8 tom         40
9 tom        100

这里有一种data.table方法，对于许多行来说可能更快。

library(data.table)
df <- structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(50000, 
582, 27, 592, 16, 250, 33, 10, 40, 100)), row.names = c(NA, -10L
), class = "data.frame")
df
User duration
1   amy    50000
2   amy      582
3   amy       27
4   amy      592
5   amy       16
6   amy      250
7   tom       33
8   tom       10
9   tom       40
10  tom      100

代码

setDT(df)[,.SD[duration <= mean(duration) + (2 * sd(duration)) &
duration >= mean(duration) - (2 * sd(duration)),]
,by=User]
User duration
1:  amy      582
2:  amy       27
3:  amy      592
4:  amy       16
5:  amy      250
6:  tom       33
7:  tom       10
8:  tom       40
9:  tom      100

我们可以尝试在dplyr中使用mutate和filter函数

library(dplyr)
df %>% group_by(User) %>% mutate(ave_plus2sd=ave(duration)+2*sd(duration)) %>% 
filter(duration < ave_plus2sd)

这将为您提供以下输出，允许将每个条目与用户的平均+2*sd进行比较。

# Groups:   User [2]
User  duration ave_plus2sd
<fct>    <int>       <dbl>
1 amy        582        861.
2 amy         27        861.
3 amy        592        861.
4 amy         16        861.
5 amy        250        861.
6 tom         33        122.
7 tom         10        122.
8 tom         40        122.
9 tom        100        122.

我们可以进一步添加%>% select (User,duration)来选择感兴趣的列User和duration。

相关内容

最新更新

热门标签：