pyspark:withcolumn将列转换为小写后发生分析错误



我有一个数据帧,看起来像下面的

+------------+------+
|        food|pounds|
+------------+------+
|       bacon|   4.0|
|STRAWBERRIES|   3.5|
|       Bacon|   7.0|
|STRAWBERRIES|   3.0|
|       BACON|   6.0|
|strawberries|   9.0|
|Strawberries|   1.0|
|      pecans|   3.0|
+------------+------+

预期输出为

+------------+------+---------+
|        food|pounds|food_type|
+------------+------+---------+
|       bacon|   4.0|     meat|
|STRAWBERRIES|   3.5|    fruit|
|       Bacon|   7.0|     meat|
|STRAWBERRIES|   3.0|    fruit|
|       BACON|   6.0|     meat|
|strawberries|   9.0|    fruit|
|Strawberries|   1.0|    fruit|
|      pecans|   3.0|    other|
+------------+------+---------+

因此,我基本上基于我的逻辑定义了一个new_column,并将其应用于.withcolumn

new_column = when((col('food') == 'bacon') | (col('food') == 'BACON') | (col('food') == 'Bacon'), 'meat'
).when((col('food') == 'STRAWBERRIES') | (col('food') == 'strawberries') | (col('food') == 'Strawberries'), 'fruit'
).otherwise('other')

然后

df.withColumn("food_type", new_column).show()

这很好用。但我想用更少的代码更新new_column语句,所以重写如下

new_column = when(lower(col('food') == 'bacon') , 'meat'
).when(lower(col('food') == 'strawberries'), 'fruit'
).otherwise('other')

现在当我做df.withColumn("food_type", new_column).show()

我得到错误

AnalysisException: "cannot resolve 'CASE WHEN lower(CAST((`food` = 'bacon') AS STRING)) THEN 'meat' WHEN lower(CAST((`food` = 'strawberries') AS STRING)) THEN 'fruit' ELSE 'other' END' due to data type mismatch: WHEN expressions in CaseWhen should all be boolean type, but the 1th when expression's type is lower(cast((food#165 = bacon) as string));;n'Project [food#165, pounds#166, CASE WHEN lower(cast((food#165 = bacon) as string)) THEN meat WHEN lower(cast((food#165 = strawberries) as string)) THEN fruit ELSE other END AS food_type#197]n+- Relation[food#165,pounds#166] csvn"

我错过了什么?

括号不匹配。

new_column = when(lower(col('food')) == 'bacon' , 'meat').when(lower(col('food')) == 'strawberries', 'fruit').otherwise('other')

我想分享另一种方法,它更类似于SQL查询,也更适合于更复杂和嵌套的条件。

from pyspark.sql.functions import *
cond = """case when lower(food) in ('bacon') then 'meat'
else case when lower(food) in ('strawberries') then 'fruit'
else 'other'
end
end"""
newdf = df.withColumn("food_type", expr(cond))

希望能有所帮助。

问候,

Neeraj

简化:

new_column=when(lower(col("food"))=="bacon","meat")。when(low(col"("foot"))==="rawberries","fruit")。否则("other")

df.withColumn("food_type",new_column).show()

相关内容

  • 没有找到相关文章

最新更新