映射数组并保持原始格式



我使用的是sparksql数据帧。

df = sql.read.parquet("toy_data")
df.show()
+-----------+----------+
|          x|         y|
+-----------+----------+
| -4.5707927| -5.282721|
|  -5.762503| -4.832158|
|   7.907721|  6.793022|
|  7.4408655| -6.601918|
| -4.2428184| -4.162871|

我有一个元组列表,结构如下:

(行(x=-8.45811653137207,y=-5.179722309112549),((-1819.74814533043,47.745243303477764),333)

其中第一个ele是点,第二个ele为(sum_of_points,number_of_ppoints)元组。

当我将sum_of_points除以num_of_ppoints时,如下所示:

new_centers = center_sum_num.map(lambda tup: np.asarray(tup[1][0])/tup[1][1]).collect()

我得到以下内容,这是一个numpy数组的数组。

[array([-0.10006594, -6.7719144 ]), array([-0.25844196,  5.28381418]), array([-5.12591623, -4.5685448 ]), array([ 5.40192709, -4.35950824])]

然而,我想保留它们原始格式的点,比如:

[Row(x=-5.659833908081055, y=7.705344200134277), Row(x=3.17942214012146, y=-9.446121215820312), Row(x=9.128270149230957, y=4.5666022300720215), Row(x=-6.432034969329834, y=-4.432190895080566)]

意思是我不想要numpy_arrays的数组——我想要Row(x=…,y=…)thingys的数组。

我该怎么做?

附上我的完整代码供参考:

new_centers = [Row(x=-5.659833908081055, y=7.705344200134277), Row(x=3.17942214012146, y=-9.446121215820312), Row(x=9.128270149230957, y=4.5666022300720215), Row(x=-6.432034969329834, y=-4.432190895080566)]


while old_centers is None or not has_converged(old_centers, new_centers, epsilon) and iteration < max_iterations:
    # update centers
    old_centers = new_centers

    center_pt_1 = points.rdd.map(lambda point: ( old_centers[nearest_center(old_centers, point)[0]], (point, 1) ) )
    note that nearest_center()[0] is the index
    center_sum_num =center_pt_1.reduceByKey(lambda a, b: ((a[0][0] + b[0][0], a[0][1] + b[0][1]) ,a[1] + b[1]))

    new_centers = center_sum_num.map(lambda tup: np.asarray(tup[1][0])/tup[1][1]).collect()


    iteration += 1
return new_centers

定义结构

from pyspark.sql import Row
row = Row("x", "y")

和解压缩结果:

x = (
    Row(x=-8.45811653137207, y=-5.179722309112549),  
    ((-1819.748514533043, 47.745243303477764), 333)
)
f = lambda tup: row(*np.asarray(tup[1][0]) / tup[1][1])
f(x)
## Row(x=-5.4647102538529815, y=0.14337910901945275)

相关内容

  • 没有找到相关文章

最新更新