在 Spark 中,我有以下名为"df"的数据框,其中包含一些空条目:
+-------+--------------------+--------------------+
| id| features1| features2|
+-------+--------------------+--------------------+
| 185|(5,[0,1,4],[0.1,0...| null|
| 220|(5,[0,2,3],[0.1,0...|(10,[1,2,6],[0.1,...|
| 225| null|(10,[1,3,5],[0.1,...|
+-------+--------------------+--------------------+
df.features1 和 df.features2 是类型向量(可为空)。然后我尝试使用以下代码用稀疏向量填充空条目:
df1 = df.na.fill({"features1":SparseVector(5,{}), "features2":SparseVector(10, {})})
此代码导致以下错误:
AttributeError: 'SparseVector' object has no attribute '_get_object_id'
然后我在 Spark 文档中找到了以下段落:
fillna(value, subset=None)
Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.
Parameters:
value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, or string.
这是否解释了我未能在数据帧中用稀疏向量替换空条目?或者这是否意味着在数据帧中无法执行此操作?
我可以通过将 DataFrame 转换为 RDD 并用 SparseVectors 替换 None 值来实现我的目标,但直接在 DataFrame 中执行此操作会方便得多。
有什么方法可以直接在数据帧中执行此操作吗?谢谢!
您可以使用
udf
:
from pyspark.sql.functions import udf, lit
from pyspark.ml.linalg import *
fill_with_vector = udf(
lambda x, i: x if x is not None else SparseVector(i, {}),
VectorUDT()
)
df = sc.parallelize([
(SparseVector(5, {1: 1.0}), SparseVector(10, {1: -1.0})), (None, None)
]).toDF(["features1", "features2"])
(df
.withColumn("features1", fill_with_vector("features1", lit(5)))
.withColumn("features2", fill_with_vector("features2", lit(10)))
.show())
# +-------------+---------------+
# | features1| features2|
# +-------------+---------------+
# |(5,[1],[1.0])|(10,[1],[-1.0])|
# | (5,[],[])| (10,[],[])|
# +-------------+---------------+