所以我有(firstname, list of surnames)
键值对。由此,我可以轻松创建(firstname, surname)
键值对以及(surname, list of firstnames)
键值对。但是,我需要(firstname, firstname)
相同姓氏的人的键值对。
我现在解决这个问题的方法是将一个函数传递给flatMap
函数,该函数将(surname, list of firstnames)
作为输入,并通过根据姓氏遍历名字列表来返回(firstname, firstname)
。但是,我注意到 Spark 无法正确并行化我的程序。我想知道这个结果是否只能通过使用map
和join
函数来实现,也就是说,不需要我们为此编写一个特殊的函数 flatMap
?
换句话说,这是我输入的一个例子
(FirstName1, [Surname1, Surname2, Surname3]),
(FirstName2, [Surname2, Surname4]),
(FirstName3, [Surname5, Surname6]),
(FirstName4, [Surname6, Surname7]),
(FirstName5, [Surname1, Surname4])
为此,我们应该有以下输出
(FirstName1, FirstName2),
(FirstName2, FirstName1),
(FirstName2, FirstName5),
(FirstName3, FirstName4),
(FirstName4, FirstName3),
(FirstName5, FirstName1),
(FirstName5, FirstName2)
flatMap
到(surname, firstname)
对,然后groupByKey
?
val bySurname = myRdd.flatMap{case (firstname, surnames) => surnames map {s => (s, firstname)}}
bySurname.groupByKey.map(_._2)
//then flatMap to go from a Set of firstnames to a load of pairs of them.