scala查找键控序列的前k个元素

对于第一个元素构成密钥的一系列事物：

val things = Seq(("key_1", ("first", 1)),("key_1", ("first_second", 11)), ("key_2", ("second", 2)))

我想计算一个键出现的频率，然后只保留前k个元素。

在熊猫或数据库中，我会：

计数
将结果连接到原始结果并进行筛选

在Scala中，第一部分可以由处理

things.groupBy(identity).mapValues(_.size)

这里的第一位是：

things.groupBy(_._1).mapValues(_.map( _._2 ))

但我不确定第二步是什么。在上面的示例的情况下，当查看前1个键时，key_1出现两次并且因此被选择。期望输出的结果是前k个密钥元组的第二个元素：

Seq(("first", 1),("first_second", 11))

编辑

我需要一个适用于2.11.x 的解决方案

这种方法首先按键分组，以获得键到原始项的映射。

您也可以使用OrderedMap或PriorityQueue进行更高效的top-N计算，但如果元素不多，那么简单的sortBy也可以，如图所示。

def valuesOfNMostFrequentKeys(things: Seq[(String, (String, Int))], N: Int = 1) = {
val grouped: Map[String,Seq[(String, (String, Int))]] = things.groupBy(_._1)
// "map" array of counts per keys to KV Tuples 
val countToTuples:Array[(Int, Seq[(String, (String, Int))])]  = grouped.map((kv: (String, Seq[(String, (String, Int))])) => (kv._2.size, kv._2)).toArray
// sort by count (first item in tuple) descending and take top N
val sortByCount:Array[(Int, Seq[(String, (String, Int))])] = countToTuples.sortBy(-_._1)
val topN:Array[(Int, Seq[(String, (String, Int))])] = sortByCount.take(N)
// extract inner (String, Int) item from list of keys and values, and flatten
topN.flatMap((kvList: (Int, Seq[(String, (String, Int))])) => kvList._2.map(_._2))
}
valuesOfNMostFrequentKeys(things)

输出：

valuesOfNMostFrequentKeys: (things: Seq[(String, (String, Int))], N: Int)Array[(String, Int)]
res44: Array[(String, Int)] = Array((first,1), (first_second,11))

注意，上面是一个数组，您可能想要执行toSeq，但这在Scala2.11中有效。

它看起来像：

things.groupBy(_._1)
.mapValues(e => (e.map(_._2).size, e.map(_._2))).toSeq.map(_._2)
.sortBy(_._1).reverse.take(2).flatMap(_._2)

计算所需的输出

编辑

相关内容

最新更新

热门标签：