我正在尝试使用sql来识别包含异常的分组订单/记录,而不是只将记录与该异常一起输出。显示所有行,包括有异常的行。我需要识别记录的组,其中该组中只有一个项目=0。
order number item number qty
12345 G123 5
12345 C123 4
12345 I123 0
12345 K123 6
我正在使用Spark SQL,并尝试使用Group-By子句,该子句的计数为不同的订单号>1
旧的经典SQL,看看它的性能很有趣。
使用我自己的数据,但要证明这一点。优化器可能会重写查询。
val df = Seq(
( "A", "Bangalore", "a*.com", 1, "cpu" ),
( "A", "Bangalore", "a*.com", 9, "cpu" ),
( "A", "Bangalore", "a*.com", 7, "desktop" ),
( "C", "Bangalore", "a*.com", 0, "desktop" ),
( "C", "Bangalore", "a*.com", 0, "desktop" ),
( "D", "Bangalore", "a*.com", 0, "desktop" ),
( "D", "Bangalore", "a*.com", 0, "desktop" ),
( "D", "Bangalore", "a*.com", 0, "desktop" ),
( "B", "Bangalore", "a*.com", 5, "desktop" ),
( "B", "Bangalore", "a*.com", 0, "desktop" ),
( "B", "Bangalore", "a*.com", 19, "monitor" ),
).toDF("name" ,"address", "email", "floor", "resource")
df.createOrReplaceTempView("R")
val res = spark.sql("""
select *
from R
where R.name IN (
select X.name
from (select name, count(*)
from R
where floor = 0
group by name
having count(*) = 1 ) X
)
""")
res.show(false)
退货:
+----+---------+------+-----+--------+
|name|address |email |floor|resource|
+----+---------+------+-----+--------+
|B |Bangalore|a*.com|5 |desktop |
|B |Bangalore|a*.com|0 |desktop |
|B |Bangalore|a*.com|19 |monitor |
+----+---------+------+-----+--------+