pyspark-使用ArrayType列进行折叠和求和

我正在尝试进行元素求和，我已经创建了这个伪df。输出应为[10,4,4,1]

from pyspark.sql.types import StructType,StructField, StringType, IntegerType, ArrayType
data = [
("James",[1,1,1,1]),
("James",[2,1,1,0]),
("James",[3,1,1,0]),
("James",[4,1,1,0])
]
schema = StructType([ 
StructField("firstname",StringType(),True), 
StructField("scores", ArrayType(IntegerType()), True) 
])

df = spark.createDataFrame(data=data,schema=schema)

posexplode有效，但我的实际df太大，所以我试图使用fold，但它给了我一个错误。有什么想法吗？谢谢

vec_df = df.select("scores")
vec_sums = vec_df.rdd.fold([0]*4, lambda a,b: [x + y for x, y in zip(a, b)])

文件"lt；ipython-input-115-9b470dedcfef>"；，第2行，在<listcomp>
TypeError：不支持+的操作数类型："int"one_answers"list">

您需要将行的RDD映射到fold:之前的列表的RDD

vec_sums = vec_df.rdd.map(lambda x: x[0]).fold([0]*4, lambda a,b: [x + y for x, y in zip(a, b)])

为了帮助理解，您可以查看RDD的外观。

>>> vec_df.rdd.collect()
[Row(scores=[1, 1, 1, 1]), Row(scores=[2, 1, 1, 0]), Row(scores=[3, 1, 1, 0]), Row(scores=[4, 1, 1, 0])]
>>> vec_df.rdd.map(lambda x: x[0]).collect()
[[1, 1, 1, 1], [2, 1, 1, 0], [3, 1, 1, 0], [4, 1, 1, 0]]

因此，您可以想象vec_df.rdd包含一个嵌套的列表，该列表需要在fold之前未进行测试。

相关内容

最新更新

热门标签：