Scala RDD匹配类似的措辞

  • 本文关键字:RDD Scala scala rdd
  • 更新时间 :
  • 英文 :


所以我有一个动词的列表

假设:

verbs.txt

have, have, having, had
give, give, gave, given
take, take, took, taken

将它们拆分为rdds

val verbs = sc.textFile("verbs.txt").map(x => x.split("n").collect()

因此,

verbs: Array[Array[String]] = Array(Array(have, have, having, had), Array(give, give, gave, given), Array(take, take, took, taken))

假设:

val wordcount = sc.textFile("data.txt")

data.txt

have have have having having had had had had had give give give give give give give give give give gave gave given given given given take take took took took took took took taken taken

我计算了字数,因此wordcount=

(have, 3)
(having, 2)
(had, 5)
(give, 10)
(gave, 2)
(given, 4)
(take, 2)
(took, 6)
(taken, 2)

我希望能够用相同的动词将数据合并在一起示例:(have,3),(having,2),(had,5) => (have, 10)

使用数组的第一个值返回谓词的基本形式。我怎么能做到这一点?

由于您将问题标记为RDD,我假设您的字数数据是RDD。

// Read text file
val sc = spark.sparkContext
val textFile: RDD[String] = sc.textFile("data.txt")
// So you have this as you said
val verbs = Array(Array("have", "have", "having", "had"), Array("give", "give", "gave", "given"), Array("take", "take", "took", "taken"))
val data= textFile
.flatMap(_.split(" ")) // Split each line to words/tokens its called tokenization (I used backspace as seperator if you have tabs as seperator use that)
.map(t => (t, 1)) // Generate count per token (i.e. (have, 1))
.reduceByKey(_ + _) // Count appearance of each token (i.e. (have, 5)

val t = data.map(d => (verbs.find(v => v.contains(d._1)).map(_.head).getOrElse(d._1), d._2)) // Generates RDD of (optional base verb, count for that verb) e.g (having, 5) => (have, 5), unknown verbs left as it is
.reduceByKey(_ + _) // Sum all values that having same base verb (have, 5), (have, 3) => (have, 8)
t.take(10).foreach(println)

其他选项(不收集动词(

// You dont have to collect this If you want
val verbs2 = sc.parallelize(Array(Array("have", "have", "having", "had"), Array("give", "give", "gave", "given"), Array("take", "take", "took", "taken"))) // This is the state before collect
.flatMap(v => v.map(v2 => (v2, v.head))) // This generates tuples of verb -> base verb (e.g had -> have)
.reduceByKey((k1, k2) => if (k1 == k2) k1 else k2) // Current verbs array generates (have -> have twice, this eliminates duplicate records)
val data2 = textFile
.flatMap(_.split(" ")) // Split each line to words/tokens its called tokenization (I used backspace as seperator if you have tabs as seperator use that)
.map(t => (t, 1)) // Generate count per token (i.e. (have, 1))
.reduceByKey(_ + _) // Count appearance of each token (i.e. (have, 5)
val t2 = verbs2.join(data2) // This will join two RDD by their keys (verbs -> (base verb, verb count))
.map(d => d._2) // This is what we need key is base verb, value is count of that verb
.reduceByKey(_ + _) // Sum all values that having same base verb (have, 5), (have, 3) => (have, 8)
t2.take(10).foreach(println)

当然,这个答案假设你总是有动词数组,第一个元素是基本形式。如果你想要一个没有动词数组的东西,并将任何动词转换为基本格式,这实际上是一项NLP(自然语言处理(任务,你需要使用某种单词规范化技术,比如EmiCareOfCell44。您还可以在spark ML库中找到此类过程的实现。

最好广播动词的形式,然后对其进行查找。它将使查找变得简单和高效,使值在执行器中一步到位。

val conf = new SparkConf()
.setAppName("Demo")
.setMaster("local[2]")
val sc = new SparkContext(conf)
val verbs = Array(Array("have", "have", "having", "had"), Array("give", "give", "gave", "given"), Array("take", "take", "took", "taken"))
//brodcast it as a map
val verbMap = verbs.flatMap(e => {
e.map(i => i -> e(0))
}).toMap
val bdVerbMap = sc.broadcast(verbMap)
val data = sc.parallelize(List(("have", 3),
("having", 2),
("had", 5),
("give", 10),
("gave", 2),
("given", 4),
("take", 2),
("took", 6),
("taken", 2)))
//Lokkup the broadcast values to map every forms of verb then reduce by key
val unifiedVerbCnt = data.map(t => (bdVerbMap.value.getOrElse(t._1, t._1), t._2))
.reduceByKey((x, y) => x+y)
unifiedVerbCnt.collect.foreach(println)

您可以执行以下操作:

verbs
.flatten // will give single array combining all array
.map(p => (p, 1)) // Array( (have, 1), (given, 1))
.groupBy(_._1) // Map(have -> Array((have,1), (have,1)), given -> Array((given,1)))
.map(l => (l._1, l._2.map(_._2).sum)) //Map(have -> 2, given -> 1, had -> 1, took -> 1)

最好在scalaShell或IDE工具中尝试这些代码来理解。

最新更新