我有一个数据帧df1:
df1 <- data.frame(
Lot = c("13VC011","13VC018","13VC011A","13VC011B","13VC018A","13VC018C","13VC018B"),
Date = c("2013-07-12","2013-07-11","2013-07-13","2013-07-14","2013-07-16","2013-07-18","2013-07-19"),
Step = c("A","A","B","B","C","C","C"),
kg = c(31,32,14,16,10,11,10))
有时,在特定的"步骤"中,"批次"会被拆分为a、B或C,如图所示。我想总结一下,得到一个数据帧,告诉我每一步,每一批的总重量。
例如,输出应该如下所示:
df2 <- data.frame(
Lot = c("13VC011","13VC011","13VC018","13VC018"),
Step = c("A","B","A","C"),
kg = c(31,30,32,31))
因此有两个要求。如果"Lot"匹配,而不管后面的字母是什么,并且步长匹配,则出现总和。如果这两个条件都不满足,那么只需将行项目原样结转到df2中。
第2部分:所以我想介绍第三个要求。在某些情况下,批次被分成两个或三个部分,但并非所有数据都存在。在这种情况下,使用这些解决方案掩盖了这一点,并使一批产品的公斤数看起来比实际要低得多。
我想做的是找到一种方法来指示数据集是否包含例如13VC011A,但不包含13VC011B。或者,如果我们看到一个"B",但没有"a",或者一个"C",但有"B"或"a"。
所以现在原始数据帧是:
df1 <- data.frame(
Lot = c("13VC011","13VC018","13VC011A","13VC011B","13VC018A","13VC018C","13VC018B","13VC020B"),
Date = c("2013-07-12","2013-07-11","2013-07-13","2013-07-14","2013-07-16","2013-07-18","2013-07-19","2013-07-22"),
Step = c("A","A","B","B","C","C","C","B"),
kg = c(31,32,14,16,10,11,10,18))
结果df2应该看起来像:
df2 <- data.frame(
Lot = c("13VC011","13VC011","13VC018","13VC018","13VC020B"),
Step = c("A","B","A","C","B"),
kg = c(31,30,32,31,18),
Partial = c(F,F,F,F,T))
df1$Lot <- gsub("[[:alpha:]]$","",df1$Lot) #replace the character element at the end of string with `""`
aggregate(kg~Lot+Step,df1, FUN=sum)
# Lot Step kg
#1 13VC011 A 31
#2 13VC011 B 30
#3 13VC018 A 32
#4 13VC018 C 31
或使用dplyr
library(stringr)
library(dplyr)
df1%>%
group_by(Lot=str_extract(Lot,perl('.*\d(?=[A-Z]?$)')), Step) %>%
summarize(kg=sum(kg))
#Source: local data frame [4 x 3]
#Groups: Lot
# Lot Step kg
#1 13VC011 A 31
#2 13VC011 B 30
#3 13VC018 A 32
#4 13VC018 C 31
解释
regex
.*
:选择多个元素
\d
:后面跟着一个数字
(?=[A-Z]?$)
:并查找不在字符串$
末尾的字符元素或(?
(。
`
> aggregate(kg ~Lot + Step, data=df1, FUN=sum)
Lot Step kg
1 13VC011 A 31
2 13VC018 A 32
3 13VC011A B 14
4 13VC011B B 16
5 13VC018A C 10
6 13VC018B C 10
7 13VC018C C 11
在那一点上,我终于理解了你所说的"不管后面的字母是什么"的意思,并想知道聚合的公式方法是否可以用以下术语之一处理R函数:
> aggregate(kg ~substr(Lot,1,7) + Step, data=df1, FUN=sum)
substr(Lot, 1, 7) Step kg
1 13VC011 A 31
2 13VC018 A 32
3 13VC011 B 30
4 13VC018 C 31