python 2.7-如何使用map()将(key,values)对转换为仅在Pyspark中使用的值



我在PySpark中有这段代码。

wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList, 4)

wordCounts = wordPairs.reduceByKey(lambda x,y:x+y)
print wordCounts.collect()
#PRINTS-->  [('rat', 2), ('elephant', 1), ('cat', 2)]
from operator import add
totalCount = (wordCounts
              .map(<< FILL IN >>)
              .reduce(<< FILL IN >>))
#SHOULD PRINT 5
#(wordCounts.values().sum()) // does the trick but I want to this with map() and reduce()

I need to use a reduce() action to sum the counts in wordCounts and then divide by the number of unique words.

*但首先我需要将由(key,value)对组成的RDD wordCounts对映射()到值的RDD

这就是我陷入困境的地方。我试过下面这样的东西,但都不起作用:

.map(lambda x:x.values())
.reduce(lambda x:sum(x)))
AND,
.map(lambda d:d[k] for k in d)
.reduce(lambda x:sum(x)))

如有任何帮助,我们将不胜感激!

最后我得到了答案,它是这样的-->

wordCounts
.map(lambda x:x[1])
.reduce(lambda x,y:x + y)

是的,.map中的lambda函数将元组x作为参数,并通过x[1](元组中的第二个索引)返回第二个元素。您也可以将元组作为参数,并返回第二个元素,如下所示:

.map(lambda (x,y) : y)

Mr。Tompsett,我也做了这个:

from operator import add
x = (w
     .map(lambda x: x[1])
     .reduce(add))

或者,为了映射reduce,您也可以使用aggregate,它应该更快:

In [7]: x = sc.parallelize([('rat', 2), ('elephant', 1), ('cat', 2)])
In [8]: x.aggregate(0, lambda acc, value: acc + value[1], lambda acc1, acc2: acc1 + acc2)
Out[8]: 5

相关内容

  • 没有找到相关文章

最新更新