我有一个名为"df_array"的火花数据帧,它将始终返回单个数组作为输出,如下所示。
arr_value
[M,J,K]
我想提取它的值并添加到另一个数据帧。下面是我正在执行的代码
val new_df = old_df.withColumn("new_array_value", df_array.col("UNCP_ORIG_BPR"))
但我的代码总是失败说"org.apache.spark.sql.AnalysisException: solved's attribute(s)"
有人可以帮助我吗
这里需要的操作是join
您需要在两个数据帧中都有一个公共列,该列将用作"键"。
联接后,可以select
哪些列要包含在新数据帧中。
更详细的可以在这里找到:https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
join(其他, on=none, how=none)
Joins with another DataFrame, using the given join expression.
Parameters:
other – Right side of the join
on – a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
how – str, default ‘inner’. One of inner, outer, left_outer, right_outer, leftsemi.
The following performs a full outer join between df1 and df2.
>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
[Row(name=None, height=80), Row(name=u'Bob', height=85), Row(name=u'Alice', height=None)]
如果您知道df_array
只有一条记录,则可以使用 first()
将其收集到驱动程序,然后将其用作文本值数组以在任何数据帧中创建列:
import org.apache.spark.sql.functions._
// first - collect that single array to driver (assuming array of strings):
val arrValue = df_array.first().getAs[mutable.WrappedArray[String]](0)
// now use lit() function to create a "constant" value column:
val new_df = old_df.withColumn("new_array_value", array(arrValue.map(lit): _*))
new_df.show()
// +--------+--------+---------------+
// |old_col1|old_col2|new_array_value|
// +--------+--------+---------------+
// | 1| a| [M, J, K]|
// | 2| b| [M, J, K]|
// +--------+--------+---------------+