如何根据列表中的值选择火花数据帧的一行



我有一个列表l,看起来像l = ["3", "on", "red"],还有一个spark数据帧df,看起来像

cel1 cel2 cel3 cel4 cel5 cel6 rule1 rule2 rule3
2    3    20   5    67   2    1     off   green
12   3    21   3    60   21   3     on    red
32   6    23   2    67   1    9     on    green

如何选择spark数据帧中最后3列的值与列表l的值匹配的行。例如,在这种情况下,只应返回第二行。

提前感谢!

您可以迭代列表以生成条件列表,并使用逐位and将它们减少为一个条件。确保列类型与列表l中的列类型匹配,例如列rule1应为字符串类型,而不是int类型。

import pyspark.sql.functions as F
from functools import reduce
l = ["3", "on", "red"]
cols = ["rule1", "rule2", "rule3"]
df2 = df.filter(
reduce(
lambda a, b: a & b,
[F.col(x) == y for (x, y) in zip(cols, l)]
)
)

另一个使用spark-sql和concat_ws((函数的解决方案。

df = spark.sql(""" with t1 (
select  2 c1,  3   c2,  20 c3, 5  c4, 67 c5,   2  c6, 1  c7,  'off' c8, 'green' c9 union all
select  12 c1,  3   c2,  21 c3, 3  c4, 60 c5,   21  c6, 3  c7,  'on' c8, 'red' c9 union all
select  32 c1,  6   c2,  23 c3, 2  c4, 67 c5,   1  c6, 9  c7,  'on' c8, 'green' c9 
)  select   c1  cel1,   c2  cel2,   c3  cel3,   c4  cel4,   c5  cel5,   c6  cel6,   c7  rule1,   c8  rule2,   c9 rule3    from t1
""")
df.show()
df.createOrReplaceTempView("df")
+----+----+----+----+----+----+-----+-----+-----+
|cel1|cel2|cel3|cel4|cel5|cel6|rule1|rule2|rule3|
+----+----+----+----+----+----+-----+-----+-----+
|   2|   3|  20|   5|  67|   2|    1|  off|green|
|  12|   3|  21|   3|  60|  21|    3|   on|  red|
|  32|   6|  23|   2|  67|   1|    9|   on|green|
+----+----+----+----+----+----+-----+-----+-----+

inp = ["3", "on", "red"]
cols = ["rule1", "rule2", "rule3"]
inp_str=",".join(inp)
cols_str=",".join(cols)
spark.sql(f"""select * from df where concat_ws(",",{cols_str})='{inp_str}' """).show()
+----+----+----+----+----+----+-----+-----+-----+
|cel1|cel2|cel3|cel4|cel5|cel6|rule1|rule2|rule3|
+----+----+----+----+----+----+-----+-----+-----+
|  12|   3|  21|   3|  60|  21|    3|   on|  red|
+----+----+----+----+----+----+-----+-----+-----+

相关内容

最新更新