PySpark基于第二个值作为键将字符串组合成一对

我是pyspark的新手，我仍在努力了解如何映射和减少工作。

我有一个数据集作为RDD读取，附在learn.txt之后。

根据第二个值(数字一个)，我想看看哪两个字母有相同的值，有多少次。

我当前的代码输出:

[(('b', 'c'), 1),
('c', 1),
('d', 1),
('a', 1),
(('a', 'b'), 2),
((('a', 'b'), 'c'), 1)]

我想要的输出:

[(('b','a'),3),
(('a','b'),3),
(('b','c'),2),
(('c','b'),2),
(('a','c'),1),
(('c','a'),1)]

在所有的排列中，如果它们有一个匹配，则只对。

我不相信我的代码会有太大的帮助，但这是我得到的:

from pyspark import RDD, SparkContext
from pyspark.sql import DataFrame, SparkSession
sc = SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()
df = sc.textFile("learn.txt")
mapped = df.map(lambda x: [a for a in x.split(',')])
remapped = mapped.map(lambda x: (x[1], x[0]))
reduced = remapped.reduceByKey(lambda x,y: (x,y))
threemapped = reduced.map(lambda x: (x[1], 1))
output = threemapped.reduceByKey(lambda x, y: x+y)
output.collect()

其中learn.txt:

a,1
a,2
a,3
a,4
b,2
b,3
b,4
b,6
c,2
c,5
c,6
d,7

使用.reduceByKey(lambda x,y: (x,y))，您可以创建元组的元组的交织元组…你不能用它做任何事情。

因为你正在寻找一对共享一个键的值，我们可以使用这样的连接:

# same code as you
vals = df
.map(lambda x: [a for a in x.split(',')])
.map(lambda x: (x[1], x[0]))
# but then you can join vals with itself and use reduceByKey to count occurrences
result = vals.join(vals)
.filter(lambda x: x[1][0] != x[1][1])
.map(lambda x: ((x[1][1], x[1][0]), 1))
.reduceByKey(lambda a, b: a+b)
.collect()

收益率:

[(('b', 'a'), 3), (('c', 'a'), 1), (('c', 'b'), 2),
(('b', 'c'), 2), (('a', 'b'), 3), (('a', 'c'), 1)]

相关内容

最新更新

热门标签：