假设我们有这个RDD:
RDDs = sc.parallelize([["panda", 0], ["pink", 3]])
由于RDD现在有两列,想要像这样获得两个RDD:
RDDList[0] = (["panda"], ["pink"])
RDDList[1] = ([0], [3])
以前找不到关于这个话题的讨论,这甚至可行吗?
您可以执行以下操作
RDDs = sc.parallelize([["panda", 0], ["pink", 3]])
cols = [0, 1]
RDDList = [(RDDs.map(lambda x: [x[col]]).collect()) for col in cols]
应该给你
print RDDList[0]
#[['panda'], ['pink']]
print RDDList[1]
#[[0], [3]]
我希望答案对您有所帮助
这是建立在Maharjan@Ramesh答案之上的,以使其适用于任何RDD (蟒蛇 3.x(
RDDList = []
for i in range(0,len(RDDs.first())):
RDDList.append(RDDs.map(lambda x: [x[i]]).collect())
print (RDDList[0])
print (RDDList[1])
预期输出:
[['panda'], ['pink']]
[[0], [3]]