我有一个表,其中包含申请人ID、申请日期和职位描述。
我正在尝试识别重复项,定义为同一申请人ID在其他申请后3天内申请同一职位描述。
我已经在同一天这样做了,这样:
CREATE TABLE Duplicates
SELECT
COUNT (ApplicantID) as ApplicantCount
ApplicantID
ApplicationDate
JobDescription
FROM Applications
GROUP BY ApplicantID,ApplicationDate,JobDescription
-
DELETE FROM Duplicates WHERE ApplicantCount <2
SELECT COUNT(*) FROM Duplicates
我现在正在努力使它不必在ApplicationDate上与完全匹配,而是在一个范围内。你是怎么做到的?
您可以使用lead()
/lag()
。这里有一个例子,当有重复时返回第一个应用程序:
SELECT a.*
FROM (SELECT a.*,
LEAD(ApplicationDate) OVER (PARTITION BY ApplicantID, JobDescription) as next_ad
FROM Applications a
) a
WHERE next_ad <= ApplicationDate + INTERVAL 3 DAY;
您也可以使用exists
:来表达这一点
select a.*
from applications a
where exists (select 1
from applications a2
where a2.ApplicantID = a.ApplicantID and
a2.JobDescription = a.JobDescription and
a2.ApplicationDate > a.ApplicationDate and
a2.ApplicationDate <= a.ApplicationDate + interval 3 day
);