将某些值的行与聚合合并，然后将聚合子集返回到数据框

新手又来了，决心提出一个比我上次更好的可重现问题。我的数据框：

> str(Denton)
'data.frame':   1666 obs. of  8 variables:
$ MIL.ID     : Factor w/ 18840 levels "","0000151472",..: 7393 3955 3955 3955 3871 3871 8627 8627 1609 11652 ...
$ Center     : int  8130 8130 8130 8130 8130 8130 8130 8130 8130 8130 ...
$ Gift.Date  : Factor w/ 339 levels "","01/01/2015",..: 3 6 6 6 7 7 7 7 8 8 ...
$ Gift.Amount: num  25 50 50 50 25 25 50 50 2500 20 ...
$ Solic.     : Factor w/ 31 levels "","aa","ac","an",..: 24 20 20 20 20 20 20 20 11 11 ...
$ Tender     : Factor w/ 10 levels "","c","ca","cc",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Account    : Factor w/ 16 levels "","29101-0000",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Restriction: Factor w/ 258 levels "","AAU","ACA",..: 1 43 43 43 43 43 43 43 43 43 ...
> head(Denton)
MIL.ID     Center   Gift.Date         Gift.Amount Solic. Tender    Account Restriction
0000741377   8130 01/02/2015           25          ps     ca       29101-0000            
0000551071   8130 01/05/2015           50          mem    ca       29101-0000 BWC
0000551071   8130 01/05/2015           50          mem    ca       29101-0000 BWC
0000551071   8130 01/05/2015           50          mem    ca       29101-0000 BWC
0000544358   8130 01/06/2015           25          mem    ca       29101-0000 BWC
0000544358   8130 01/06/2015           25          mem    ca       29101-0000 BWC

我的最终目标只是返回此数据框的摘要数据，但有一个警告：有一种招标类型"pd"，即工资扣除，每年发生 26 次。从技术上讲，每个工资扣除都是一份礼物的一部分，即不是 26 份礼物，而是一份。我正在尝试做的是将与 pd 和 MIL.ID（这是捐赠者 ID）相关的礼物金额结合起来，这样每个人的多次工资扣除就会合并为一份礼物。这部分并不太难，因为我在堆栈溢出的其他一些示例中发现了一些帮助：

> df <- aggregate(Gift.Amount~MIL.ID,subset(Denton,Tender=="pd"),sum)
> head(df)
   MIL.ID     Gift.Amount
1 0000308080         324
2 0000308492          24
3 0000756682           4
4 0000757228          24
5 0000776957         850
6 0000777108         213

此数据框包含与工资扣除关联的 MIL.ID，并将每个 MIL.ID 下的这些 pd 条目相加。现在是我微不足道的大脑自我屈服的部分。回想一下之前我想简单地

summary(Denton)

一旦我将招标下的 pd 与其相关 MIL.ID 相加，就抢走均值和中位数。有问题的是，工资扣除的汇总数据现在仅作为独立的数据框存在。我不知何故需要：

1）消除"招标"下的旧pd行，2）合并丹顿和DF数据框3）汇总数据

这是我在基础R中能够弄清楚的：

>Denton[Denton$Tender!=pd,]

现在那些正在招标的原始pd已经消失了。但是，我无法将丹顿和 df 重新绑定在一起，因为：

>str(df)
data.frame':    77 obs. of  2 variables:
$ MIL.ID     : Factor w/ 18840 levels "","0000151472",..: 1613 1617 7967 7991 8627 8637 8797 8899 9807 11371 ...
$ Gift.Amount: num  324 24 4 24 850 213 360 4 11 24 ...

两个数据帧都是矩形的，长度不同，因此 R 无法在不踢出

Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 502, 77.

base R 中有没有办法解决这个问题，还是我需要下载 reshape 包并学习如何融化？我什至需要使用聚合函数使事情变得如此复杂吗？

编辑评论：

Denton 的当前 head（）：

 > head(Denton)
MIL.ID     Center   Gift.Date         Gift.Amount Solic. Tender    Account Restriction
0000741377   8130 01/02/2015           25          ps     ca       29101-0000            
0000551071   8130 01/05/2015           50          mem    pd       29101-0000 BWC
0000551071   8130 01/05/2015           50          mem    pd       29101-0000 BWC
0000551071   8130 01/05/2015           50          mem    pd       29101-0000 BWC
0000544358   8130 01/06/2015           25          mem    pd       29101-0000 BWC
0000544358   8130 01/06/2015           25          mem    pd       29101-0000 BWC

完成我想

完成的事情后所需的输出：

> head(Denton)
MIL.ID     Center   Gift.Date         Gift.Amount Solic. Tender    Account Restriction
0000741377   8130 01/02/2015           25          ps     ca       29101-0000            
0000551071   8130 01/05/2015          150          mem    pd       29101-0000 BWC
0000544358   8130 01/06/2015           50          mem    pd       29101-0000 BWC
0000556000   8130 01/05/2015           50          mem    ca       29101-0000 BWC
0000556005   8130 01/05/2015           50          mem    ca       29101-0000 BWC
0000556100   8130 01/05/2015           50          mem    ca       29101-0000 BWC

那我就

>summary(Denton)

为了得到我的手段和中位数，因为每个 MIL.ID 的 PD 招标都已合并。

对于 Dplyr 函数来说，这怎么样：

> Denton %>%
     group_by(MIL.ID) %>% #sorts by MIL.ID
     select(MIL.ID, Gift.Amount, Tender) %>% #selects these three for agg
     filter(sum(Tender) <= pd) %>% #I think this should sum where tender= pd?
     distinct #get distinct rows?

下面是使用 dplyr 包的解决方案。它不是基本的R，但极大地简化了事情，因此值得添加到R工具的Senal中。（对不起，情不自禁...

library(dplyr)
Denton <- data.frame("MIL.ID" = c(1,2,2,3,3,4),
                    "Tender" = c("ca", "pd", "pd", "pd", "pd", "ab"),
                    "Gift.Amount" = c(1,2,3,4,5,6),
                    "Solic" = c("ps", "mem", "mem", "mem", "mem", "ps")
                    )

这给了

  MIL.ID Tender Gift.Amount Solic
1      1     ca           1    ps
2      2     pd           2   mem
3      2     pd           3   mem
4      3     pd           4   mem
5      3     pd           5   mem
6      4     ab           6    ps

现在，使用 dplyr 的函数来做你想做的事：

Denton %>% group_by(MIL.ID) %>%  # This groups by MIL.ID    
        mutate( Gift.Amount = sum(Gift.Amount)) %>%   # This gets the sum of each Gift.Amount
        distinct # This gets the distinct rows

输出：

Source: local data frame [4 x 4]
Groups: MIL.ID [4]
  MIL.ID Tender Gift.Amount  Solic
   (dbl) (fctr)       (dbl) (fctr)
1      1     ca           1     ps
2      2     pd           5    mem
3      3     pd           9    mem
4      4     ab           6     ps

注释：

这假设对于给定的MIL.ID，除了 Gift.Amount之外，所有pd行都是相似的，根据上面的示例，情况似乎是这样。（如果不是，请使用什么逻辑来决定保留哪一行来更新您的问题，我将更新我的答案以使用该逻辑。

我还把sum带到了所有Tenders，而不仅仅是pd招标，因为一个项目的总和只是该项目的价值，这样做意味着我不需要分开然后绑定回两个不同的 dfs。

编辑

另一种选择是您可以将Denton df 分为两部分：

df_notpd <- Denton %>% filter(Tender != "pd");
df_pd <- Denton %>% filter(Tender == "pd");

# Now do the necessary logic on *only* the pd portion.
df_pd <- df_pd group_by(MIL.ID) %>%  # This groups by MIL.ID    
        mutate( Gift.Amount = sum(Gift.Amount)) %>%   # This gets the sum of each Gift.Amount
        distinct # This gets the distinct rows
# Then rbind back with df_notpd
df <- rbind(df_notpd, df_pd)

相关内容

最新更新

热门标签：