我是Pyspark的新手,我有一些疑问。
我有一个这样的df:
+---+---+-------+
| a1| a2|formula|
+---+---+-------+
| 18| 12| a1+a2|
| 11| 1| a1-a2|
+---+---+-------+
我正在尝试解析列"formula",以创建一个解析了公式的新列,并获得类似的df
+---+---+-------+----------------+
| a1| a2|formula|resolved_formula|
+---+---+-------+----------------+
| 18| 12| a1+a2| 30|
| 11| 1| a1-a2| 10|
+---+---+-------+----------------+
我试过使用
df2 = df.withColumn('resolved_formula', f.expr(df.formula))
df2.show()
但是我得到了这种类型的错误
类型错误:列不可迭代
有人能帮我吗?
非常感谢!!
这里有一种复杂的方法来做你想要做的事情
data_sdf = data_sdf.
withColumn('new_formula', func.col('formula'))
# this thing can also be done in a single regex
# technically prefix a variable before all columns to be used in a lambda func
for column in data_sdf.columns:
if column != 'formula':
data_sdf = data_sdf.
withColumn('new_formula', func.regexp_replace('new_formula', column, 'r.'+column))
# use `eval()` to evaluate the operation
data_sdf.
rdd.
map(lambda r: (r.a1, r.a2, r.formula, eval(r.new_formula))).
toDF(['a1', 'a2', 'formula', 'resolved_formula']).
show()
# +---+---+-------+----------------+
# | a1| a2|formula|resolved_formula|
# +---+---+-------+----------------+
# | 18| 12| a1+a2| 30|
# | 11| 1| a1-a2| 10|
# +---+---+-------+----------------+