r语言 - 在总结时处理可能丢失的列的最佳方法是什么? - r - What is the best way to handle potentially missing columns when summarizing? 小贝子编程网

财务报表是这个问题的一个很好的说明。下面是一个示例数据框架:

df <- data.frame(   date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 10),
category = sample(c('a','b', 'c'), 10, replace=TRUE),
direction = sample(c('credit', 'debit'), 10, replace=TRUE),
value = sample(0:25, 10, replace = TRUE) )

我想生成一个汇总表，每个类别有incoming,outgoing和total列。

df %>% 
pivot_wider(names_from = direction, values_from = value) %>% 
group_by(category) %>% 
summarize(incoming = sum(credit, na.rm=TRUE), outgoing=sum(debit,na.rm=TRUE) ) %>% 
mutate(total= incoming-outgoing)

在大多数情况下，这与上面的示例数据框架完美配合。

但是在某些情况下，df$direction可能包含单个值，例如credit，从而导致错误。

Error: Problem with `summarise()` column `outgoing`.
object 'debit' not found

假设我无法控制数据框，处理这个问题的最佳方法是什么?

我一直在使用summary方法中的条件语句来检查列是否存在，但没有设法使其工作。

...
summarize( outgoing = case_when(
"debit" %in% colnames(.) ~ sum(debit,na.rm=TRUE), 
TRUE ~ 0 ) )
...

我犯了一个语法错误，还是我走在完全错误的方向?

只有当其中一个元素出现时，问题才会发生。"贷"而不是"借"，反之亦然。然后，pivot_wider不会创建缺失的列。而不是旋转然后总结，直接使用summarise和==进行此操作，即如果"借方"不存在，sum将通过返回0来处理它

library(dplyr)
df %>%  
slice(-c(9:10)) %>% # just removed the 'debit' rows completely
group_by(category) %>% 
summarise(total  = sum(value[direction == 'credit']) - 
sum(value[direction == "debit"]))

与产出

# A tibble: 3 × 2
category total
<chr>    <int>
1 a           15
2 b           30
3 c           63

对于pivot_wider，情况并非如此

df %>% 
slice(-c(9:10)) %>%
pivot_wider(names_from = direction, values_from = value) 
# A tibble: 8 × 3
date       category credit
<date>     <chr>     <int>
1 2020-07-25 c            19
2 2020-05-09 b            15
3 2020-08-27 a            15
4 2020-03-27 b            15
5 2020-04-06 c             6
6 2020-07-06 c            11
7 2020-09-22 c            25
8 2020-10-06 c             2

它只创建'credit'列，因此当我们调用未创建的'debit'列时，它会抛出错误

df %>% 
slice(-c(9:10)) %>%
pivot_wider(names_from = direction, values_from = value)  %>%
group_by(category) %>% 
summarize(incoming = sum(credit, na.rm=TRUE), 
outgoing=sum(debit,na.rm=TRUE) )

错误:summarise()列outgoing有问题。outgoing = sum(debit, na.rm = TRUE).;找不到目标"debit"错误发生在组1:category = "a"。运行rlang::last_error()查看错误发生的位置。

在这种情况下，我们可以使用complete来创建debit以及NA来创建其他列

library(tidyr)
df %>% 
slice(-c(9:10)) %>%
complete(category, direction = c("credit", "debit")) %>% 
pivot_wider(names_from = direction, values_from = value) %>% 
group_by(category) %>% 
summarize(incoming = sum(credit, na.rm=TRUE), 
outgoing=sum(debit,na.rm=TRUE) ) %>% 
mutate(total= incoming-outgoing)
# A tibble: 3 × 4
category incoming outgoing total
<chr>       <int>    <int> <int>
1 a              15        0    15
2 b              30        0    30
3 c              63        0    63

r语言 - 在总结时处理可能丢失的列的最佳方法是什么?

相关内容

最新更新

热门标签：