我有两个数据集,df1和df2,我想把它们连接起来,然后应用一个条件,如果主机列中有多个重复的行,只取那一行(避免重复)。我将加入df1和df2 ON df1。version = df1. name AND ON df1. nameDate = df2.date
条件:purpose should = 'hi'或purpose should = 'cat'
df1
version host date
pat a16 12/1/2019
fam a16 12/1/2019
emp a16 12/1/2019
dan a16 12/1/2019
df2
name purpose date
pat hi 12/1/2019
fam cat 12/1/2019
hello dog 12/1/2019
dan bird 12/1/2019
下面是连接结果:
version host date name purpose date
pat a16 12/1/2019 pat hi 12/1/2019
fam a16 12/1/2019 fam cat 12/1/2019
的
version host date name purpose date
pat a16 12/1/2019 pat hi 12/1/2019
select df1.version, df1.host, df1.date, df2.name, df2.purpose, df2.date
from df1
left join df2
on df1.version = df2.name AND
df1.date = df2.date
where df2.purpose = 'hi' OR df2.purpose = 'cat'
我想我必须在SQL中实现IF THEN语句。上面的语句只执行连接,但没有删除连续重复的主机行。欢迎提出任何建议
您可以应用ROW_NUMBER()函数并对行进行排序以删除重复的行。但是,当您按日期排序时,目的行可以是hi
或cat
。如果您想首先选择其中一个,则必须相应地应用ORDER BY。
;with cte_df as
(
select df1.version, df1.host,df1.date, df2.name, df2.purpose, df2.date as df2date
,row_number() over(partition by df1.host order by df1.date) as rnk
FROM df1
inner join df2
on df1.version = df2.name and df1.date = df2.date
where df2.purpose in ('hi','cat')
)
SELECT * from cte_df WHERE rnk =1
Without_CTE,使用派生表
SELECT *
FROM
(
select df1.version, df1.host,df1.date, df2.name, df2.purpose, df2.date as df2date
,row_number() over(partition by df1.host order by df1.date) as rnk
FROM df1
inner join df2
on df1.version = df2.name and df1.date = df2.date
where df2.purpose in ('hi','cat')
) AS T
WHERE rnk = 1