r基于部分匹配和其他规则组合或求和行



我有一个数据帧df1:

df1 <- data.frame(
  Lot = c("13VC011","13VC018","13VC011A","13VC011B","13VC018A","13VC018C","13VC018B"),
  Date = c("2013-07-12","2013-07-11","2013-07-13","2013-07-14","2013-07-16","2013-07-18","2013-07-19"),
  Step = c("A","A","B","B","C","C","C"),
  kg = c(31,32,14,16,10,11,10))

有时,在特定的"步骤"中,"批次"会被拆分为a、B或C,如图所示。我想总结一下,得到一个数据帧,告诉我每一步,每一批的总重量。

例如,输出应该如下所示:

df2 <- data.frame(
  Lot = c("13VC011","13VC011","13VC018","13VC018"),
  Step = c("A","B","A","C"),
  kg = c(31,30,32,31))

因此有两个要求。如果"Lot"匹配,而不管后面的字母是什么,并且步长匹配,则出现总和。如果这两个条件都不满足,那么只需将行项目原样结转到df2中。


第2部分:所以我想介绍第三个要求。在某些情况下,批次被分成两个或三个部分,但并非所有数据都存在。在这种情况下,使用这些解决方案掩盖了这一点,并使一批产品的公斤数看起来比实际要低得多。

我想做的是找到一种方法来指示数据集是否包含例如13VC011A,但不包含13VC011B。或者,如果我们看到一个"B",但没有"a",或者一个"C",但有"B"或"a"。

所以现在原始数据帧是:

df1 <- data.frame(
  Lot = c("13VC011","13VC018","13VC011A","13VC011B","13VC018A","13VC018C","13VC018B","13VC020B"),
  Date = c("2013-07-12","2013-07-11","2013-07-13","2013-07-14","2013-07-16","2013-07-18","2013-07-19","2013-07-22"),
  Step = c("A","A","B","B","C","C","C","B"),
  kg = c(31,32,14,16,10,11,10,18))

结果df2应该看起来像:

df2 <- data.frame(
  Lot = c("13VC011","13VC011","13VC018","13VC018","13VC020B"),
  Step = c("A","B","A","C","B"),
  kg = c(31,30,32,31,18),
  Partial = c(F,F,F,F,T))
   df1$Lot <- gsub("[[:alpha:]]$","",df1$Lot) #replace the character element at the end of string with `""`
   aggregate(kg~Lot+Step,df1, FUN=sum)
   #    Lot Step kg
 #1 13VC011    A 31
 #2 13VC011    B 30
 #3 13VC018    A 32
 #4 13VC018    C 31

或使用dplyr

  library(stringr)
  library(dplyr)
  df1%>%
  group_by(Lot=str_extract(Lot,perl('.*\d(?=[A-Z]?$)')), Step) %>%
  summarize(kg=sum(kg))
   #Source: local data frame [4 x 3]
   #Groups: Lot
   #     Lot Step kg
   #1 13VC011    A 31
   #2 13VC011    B 30
   #3 13VC018    A 32
   #4 13VC018    C 31

解释

regex

.*:选择多个元素

\d:后面跟着一个数字

(?=[A-Z]?$):并查找不在字符串$末尾的字符元素或(?(。

`

> aggregate(kg ~Lot + Step, data=df1, FUN=sum)
       Lot Step kg
1  13VC011    A 31
2  13VC018    A 32
3 13VC011A    B 14
4 13VC011B    B 16
5 13VC018A    C 10
6 13VC018B    C 10
7 13VC018C    C 11

在那一点上,我终于理解了你所说的"不管后面的字母是什么"的意思,并想知道聚合的公式方法是否可以用以下术语之一处理R函数:

> aggregate(kg ~substr(Lot,1,7) + Step, data=df1, FUN=sum)
  substr(Lot, 1, 7) Step kg
1           13VC011    A 31
2           13VC018    A 32
3           13VC011    B 30
4           13VC018    C 31

最新更新