使用SQL标识异常的分组记录



我正在尝试使用sql来识别包含异常的分组订单/记录,而不是只将记录与该异常一起输出。显示所有行,包括有异常的行。我需要识别记录的组,其中该组中只有一个项目=0。

order number   item number   qty
12345           G123          5
12345           C123          4 
12345           I123          0 
12345           K123          6

我正在使用Spark SQL,并尝试使用Group-By子句,该子句的计数为不同的订单号>1

旧的经典SQL,看看它的性能很有趣。

使用我自己的数据,但要证明这一点。优化器可能会重写查询。

val df = Seq( 
( "A", "Bangalore", "a*.com", 1, "cpu" ),
( "A", "Bangalore", "a*.com", 9, "cpu" ),
( "A", "Bangalore", "a*.com", 7, "desktop" ),
( "C", "Bangalore", "a*.com", 0, "desktop" ),
( "C", "Bangalore", "a*.com", 0, "desktop" ),
( "D", "Bangalore", "a*.com", 0, "desktop" ),
( "D", "Bangalore", "a*.com", 0, "desktop" ),
( "D", "Bangalore", "a*.com", 0, "desktop" ),
( "B", "Bangalore", "a*.com", 5, "desktop" ),
( "B", "Bangalore", "a*.com", 0, "desktop" ),
( "B", "Bangalore", "a*.com", 19, "monitor" ),
).toDF("name" ,"address", "email", "floor", "resource")
df.createOrReplaceTempView("R")
val res = spark.sql(""" 
select *
from R
where R.name IN ( 
select X.name 
from (select name, count(*) 
from R
where floor = 0
group by name
having count(*) = 1 ) X 
)   

""")
res.show(false)

退货:

+----+---------+------+-----+--------+
|name|address  |email |floor|resource|
+----+---------+------+-----+--------+
|B   |Bangalore|a*.com|5    |desktop |
|B   |Bangalore|a*.com|0    |desktop |
|B   |Bangalore|a*.com|19   |monitor |
+----+---------+------+-----+--------+

最新更新