我现在正在努力寻找一种方法来匹配R中的两个不同的表(行数不平衡)与日期的匹配范围。
第一个数据帧表示某人声明了什么,第二个数据帧表示实际是什么。问题是,假设A可能在他/她的仓库中保存了一周或更长时间的以公斤为单位的产品,然后在一周后出现。
所以我有两行不同的数据帧。这两个数字都超过500.000行。第一个有50万,第二个有52万。(所以有些NAs可能会发生,但这无关紧要)。
我想从这两个不同的数据帧中搜索和匹配:
- 的id是相同的,
- 产品代码和man代码要一致,而
- 检查第二个数据框中的日期是否在第一个数据框日期的一个月内(例如,他/她在2020/1/1声明重量(100kg),并在2020/1/15显示,但实际重量为105)。
,如果所有这些条件都为真,则将两个权重之和(来自数据帧1和2)
期望的输出必须是一个数据帧,具有匹配的id,日期范围,产品,人,2列权重和最后一列将是权重之和的值。
示例数据帧是下面R中的两个数据帧:
id = rep("A",10)
date = seq(as.Date("2000/1/1"), by = "day", length.out = 10)
product = c("E1","E1","E2","E3","E3","E4","E1","E1","E5","E5")
man = c("PS","PS","PS","DDR","DDR","FFR","PS","PS","BA","BA")
weight = c(1505,300,259,231,140,150,300,112,203,2200)
data1 = data.frame(id,date,product,man,weight);data1
id2 = rep("A",10)
date2 = sample(seq(as.Date("2000/1/10"), by = "day", length.out = 10),10)
product2 = c("E3","E3","E1","E1","E1","E1","E5","E4","E5","E2")
man2 = c("DDR","DDR","PS","PS","PS","PS","BA","FFR","BA","PS")
weight2 = c(259,295,260,240,135,145,250,110,2000,210)
data2 = data.frame(id2,date2,product2,man2,weight2);data2
如何将它们与初始日期延迟条件匹配?
有什么帮助吗?
不确定这是不是你想要的,但它看起来像一个滚动连接…
使用数据。表和润滑包,一个版本可以是:
(这里我展示了两个例子,一个是30天远期,一个是一个月远期)
set.seed(0)
id = rep("A",10)
date = seq(as.Date("2000/1/1"), by = "day", length.out = 10)
product = c("E1","E1","E2","E3","E3","E4","E1","E1","E5","E5")
man = c("PS","PS","PS","DDR","DDR","FFR","PS","PS","BA","BA")
weight = c(1505,300,259,231,140,150,300,112,203,2200)
data1 = data.frame(id,date,product,man,weight);data1
id2 = rep("A",10)
date2 = sample(seq(as.Date("2000/1/10"), by = "day", length.out = 10),10)
product2 = c("E3","E3","E1","E1","E1","E1","E5","E4","E5","E2")
man2 = c("DDR","DDR","PS","PS","PS","PS","BA","FFR","BA","PS")
weight2 = c(259,295,260,240,135,145,250,110,2000,210)
data2 = data.frame(id2,date2,product2,man2,weight2);data2
require(data.table)
require(lubridate)
setDT(data1)
setkey(data1)
setDT(data2)
setkey(data2)
data1[, weight_p50 := weight + 50]
data1[, weight_m50 := weight - 50]
data1[, date := as.Date(date)]
data2[, date2 := as.Date(date2)]
data1[, date_plus_one_month := as.Date(ymd(date %m+% months(1)))]
data1[, date_plus_30_days := as.Date(date + days(30))]
data2[, date2_ := date2]
data1[, date_ := date]
res <-
data1[data2, list(id, product, man, date = date_, date2, weight, weight2, total_weight = weight + weight2),
on = .(id = id2,
product = product2,
man = man2,
date_plus_30_days >= date2,
date <= date2_,
weight_p50 >= weight2,
weight_m50 <= weight2),
allow.cartesian = TRUE, nomatch = 0
][][]
setkey(res, id, product, man, date, date2)
res <- unique(res, by = c("id", "product", "man"))
res[]
给了:
id product man date date2 weight weight2 total_weight
1: A E1 PS 2000-01-02 2000-01-16 300 260 560
2: A E2 PS 2000-01-03 2000-01-17 259 210 469
3: A E3 DDR 2000-01-04 2000-01-18 231 259 490
4: A E4 FFR 2000-01-06 2000-01-19 150 110 260
5: A E5 BA 2000-01-09 2000-01-12 203 250 453
编辑:在OP评论之后,我添加了id x man x product的唯一性,这样它只保留满足约束的第一次约会。