pyspark.pandas API-如何将带有列表的列分隔为多个列



我正试图在我的Databricks笔记本中将列表为[599086.9706961295, 4503107.843920314]的列分为两列("x"one_answers"y"(。

在我的Jupyter笔记本中,列的分隔方式如下:

# code from my jupter notebook
# column with list in it is: xy
# Method 1
complete[['x', 'y']] = pd.Series(np.stack(complete['xy'].values).T.tolist())
# column is also getting separated using this method
# Method 2
def sepXY(xy):
return xy[0],xy[1]
complete['x'],complete['y'] = zip(*complete['xy'].apply(sepXY))

在我的Databricks笔记本中,我收到错误:

我尝试了两种方法

import pyspark.pandas as ps
# Method 1
complete[['x', 'y']] = ps.Series(np.stack(complete['xy'].values).T.tolist())

断言错误:

如果我只运行ps.Series(np.stack(complete['xy'].values).T.tolist()),我得到的输出是x和y 的两个列表

0    [599086.9706961295, 599079.1456765212, 599059....
1    [4503107.843920314, 4503083.465809557, 4503024...

但当我将它分配给complete[['x','y']]时,它抛出了错误。

# Method 2
def sepXY(xy):
return xy[0],xy[1]
complete['x'],complete['y'] = zip(*complete['xy'].apply(sepXY))

ArrowInvalid:无法使用类型tuple转换(599086.97069612954503107.843920314(:在推断Arrow数据类型时无法识别Python值类型

我检查了数据类型,它不是元组

我也试过

complete[['x','y']] = pd.DataFrame(complete.xy.tolist(), index= complete.index)

如果我使用这个,我的内核将重新启动

# This is the column for sample
xy
[599086.9706961295, 4503107.843920314]
[599088.5389507986, 4503112.7796745915]
[599072.8088083105, 4503064.139248001]
[599090.0996424126, 4503117.721156018]
[599074.3909188313, 4503068.925677084]

输入:

complete = spark.createDataFrame(
[([599086.9706961295, 4503107.843920314],),
([599088.5389507986, 4503112.7796745915],),
([599072.8088083105, 4503064.139248001],),
([599090.0996424126, 4503117.721156018],),
([599074.3909188313, 4503068.925677084],)],
['xy']
).pandas_api()

对于上面的例子,可以这样做:

complete['x'] = complete['xy'].apply(lambda x: x[0])
complete['y'] = complete['xy'].apply(lambda x: x[1])
print(complete)
#                                         xy              x             y
# 0   [599086.9706961295, 4503107.843920314]  599086.970696  4.503108e+06
# 1  [599088.5389507986, 4503112.7796745915]  599088.538951  4.503113e+06
# 2   [599072.8088083105, 4503064.139248001]  599072.808808  4.503064e+06
# 3   [599090.0996424126, 4503117.721156018]  599090.099642  4.503118e+06
# 4   [599074.3909188313, 4503068.925677084]  599074.390919  4.503069e+06
print(complete.dtypes)
# xy     object
# x     float64
# y     float64
# dtype: object