我有一个列表l,看起来像l = ["3", "on", "red"]
,还有一个spark数据帧df,看起来像
cel1 cel2 cel3 cel4 cel5 cel6 rule1 rule2 rule3
2 3 20 5 67 2 1 off green
12 3 21 3 60 21 3 on red
32 6 23 2 67 1 9 on green
如何选择spark数据帧中最后3列的值与列表l的值匹配的行。例如,在这种情况下,只应返回第二行。
提前感谢!
您可以迭代列表以生成条件列表,并使用逐位and将它们减少为一个条件。确保列类型与列表l
中的列类型匹配,例如列rule1
应为字符串类型,而不是int类型。
import pyspark.sql.functions as F
from functools import reduce
l = ["3", "on", "red"]
cols = ["rule1", "rule2", "rule3"]
df2 = df.filter(
reduce(
lambda a, b: a & b,
[F.col(x) == y for (x, y) in zip(cols, l)]
)
)
另一个使用spark-sql和concat_ws((函数的解决方案。
df = spark.sql(""" with t1 (
select 2 c1, 3 c2, 20 c3, 5 c4, 67 c5, 2 c6, 1 c7, 'off' c8, 'green' c9 union all
select 12 c1, 3 c2, 21 c3, 3 c4, 60 c5, 21 c6, 3 c7, 'on' c8, 'red' c9 union all
select 32 c1, 6 c2, 23 c3, 2 c4, 67 c5, 1 c6, 9 c7, 'on' c8, 'green' c9
) select c1 cel1, c2 cel2, c3 cel3, c4 cel4, c5 cel5, c6 cel6, c7 rule1, c8 rule2, c9 rule3 from t1
""")
df.show()
df.createOrReplaceTempView("df")
+----+----+----+----+----+----+-----+-----+-----+
|cel1|cel2|cel3|cel4|cel5|cel6|rule1|rule2|rule3|
+----+----+----+----+----+----+-----+-----+-----+
| 2| 3| 20| 5| 67| 2| 1| off|green|
| 12| 3| 21| 3| 60| 21| 3| on| red|
| 32| 6| 23| 2| 67| 1| 9| on|green|
+----+----+----+----+----+----+-----+-----+-----+
inp = ["3", "on", "red"]
cols = ["rule1", "rule2", "rule3"]
inp_str=",".join(inp)
cols_str=",".join(cols)
spark.sql(f"""select * from df where concat_ws(",",{cols_str})='{inp_str}' """).show()
+----+----+----+----+----+----+-----+-----+-----+
|cel1|cel2|cel3|cel4|cel5|cel6|rule1|rule2|rule3|
+----+----+----+----+----+----+-----+-----+-----+
| 12| 3| 21| 3| 60| 21| 3| on| red|
+----+----+----+----+----+----+-----+-----+-----+