r-为什么dplyr删除条件不满足的值



如果满足条件,我会使用dplyrvalue替换为NA,但它会将NA放在不应该放的位置。

dput:

df <- structure(list(id = c("USC00231275", "USC00231275", "USC00231275", 
"USC00231275", "USC00231275", "USC00231275", "USC00231275", "USC00231275", 
"USC00231275", "USC00231275"), element = c("TMAX", "TMIN", "TMAX", 
"TMIN", "TMAX", "TMIN", "TMAX", "TMIN", "TMAX", "TMIN"), year = c(1937, 
1937, 1937, 1937, 1937, 1937, 1937, 1937, 1937, 1937), month = c(5, 
5, 5, 5, 5, 5, 5, 5, 5, 5), day = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 
5), date = structure(c(-11933, -11933, -11932, -11932, -11931, 
-11931, -11930, -11930, -11929, -11929), class = "Date"), value = c(0, 
53.96, 68, 44.96, 62.06, 53.96, 73.04, 53.96, 69.08, 50)), .Names = c("id", 
"element", "year", "month", "day", "date", "value"), row.names = c(NA, 
10L), class = "data.frame")

data.frame(注:条件仅在第1行和第2行满足)

            id element year month day       date value
1  USC00231275    TMAX 1937     5   1 1937-05-01  0.00
2  USC00231275    TMIN 1937     5   1 1937-05-01 53.96
3  USC00231275    TMAX 1937     5   2 1937-05-02 68.00
4  USC00231275    TMIN 1937     5   2 1937-05-02 44.96
5  USC00231275    TMAX 1937     5   3 1937-05-03 62.06
6  USC00231275    TMIN 1937     5   3 1937-05-03 53.96
7  USC00231275    TMAX 1937     5   4 1937-05-04 73.04
8  USC00231275    TMIN 1937     5   4 1937-05-04 53.96
9  USC00231275    TMAX 1937     5   5 1937-05-05 69.08
10 USC00231275    TMIN 1937     5   5 1937-05-05 50.00

dplyr

df %>%
  group_by(date) %>%
  mutate(
    value = if(value[element == 'TMIN'] >= value[element == 'TMAX'])
      as.numeric(NA) else value
  )
            id element  year month   day       date value
         (chr)   (chr) (dbl) (dbl) (dbl)     (date) (dbl)
1  USC00231275    TMAX  1937     5     1 1937-05-01    NA
2  USC00231275    TMIN  1937     5     1 1937-05-01    NA
3  USC00231275    TMAX  1937     5     2 1937-05-02 68.00
4  USC00231275    TMIN  1937     5     2 1937-05-02 44.96
5  USC00231275    TMAX  1937     5     3 1937-05-03    NA
6  USC00231275    TMIN  1937     5     3 1937-05-03    NA
7  USC00231275    TMAX  1937     5     4 1937-05-04 73.04
8  USC00231275    TMIN  1937     5     4 1937-05-04 53.96
9  USC00231275    TMAX  1937     5     5 1937-05-05 69.08
10 USC00231275    TMIN  1937     5     5 1937-05-05 50.00

请注意,只有12行应该更改,但dplyr更改了56行,即使不满足这些条件。

下面的代码应该做你试图做的

df %>%
  group_by(date) %>%
  mutate(new_value = ifelse( ( (value[element == 'TMIN'] >= value[element == 'TMAX']) & element=='TMIN'), NA, value)) %>%
  ungroup

对于这是否是一个错误的问题,我不认为是。只看一年的数据,其中TMIN>=TMAX,你有以下

df %>%
  filter(date == '1937-05-01') %>%
  mutate(res = (value[element == 'TMIN'] >= value[element == 'TMAX'])) %>%
  mutate(new_value = ifelse( (res & element=='TMIN'), NA, value))
           id element year month day       date value  res new_value
1 USC00231275    TMAX 1937     5   1 1937-05-01  0.00 TRUE         0
2 USC00231275    TMIN 1937     5   1 1937-05-01 53.96 TRUE        NA

构造value[element == 'TMIN'] >= value[element == 'TMAX'])将始终为真,如在res列中所见。下面的代码对此进行了一些分解,希望能澄清(我希望)。

### Just looking at one date
> df2 <- df %>% filter(date == '1937-05-01')
> df2
           id element year month day       date value
1 USC00231275    TMAX 1937     5   1 1937-05-01  0.00
2 USC00231275    TMIN 1937     5   1 1937-05-01 53.96
### This comparison will be recycled for every element in the group,
### so it will always be TRUE or always FALSE.
> c(df2$value[df2$element == 'TMIN'], df2$value[df2$element == 'TMAX'])
[1] 53.96  0.00

由于整个组只有一个比较,所以他们总是看到TRUE或FALSE。

给出正确结果的代码显示了如何进行比较。

一个可能的最终解决方案是:

df %>%
   group_by(date) %>%
   mutate(value = ifelse( ( (value[element == 'TMIN'] >= value[element == 'TMAX']) & element=='TMIN'), NA, value)) %>%
   ungroup

最新更新