Scala 处理字符串列表并生成 Map["组合",该组合的列表计数]



我有一个Seq[List[String]]。例:

Vector(
["B","D","A","P","F"], 
["B","A","F"], 
["B","D","A","F"], 
["B","D","A","T","F"], 
["B","A","P","F"], 
["B","D","A","P","F"], 
["B","A","F"], 
["B","A","F"], 
["B","A","F"], 
["B","A","F"]
)

我想在 Map[String,Int] 中获取不同组合(如"A","B"(的计数,其中键(字符串(是元素组合,值 (Int( 是具有此组合的列表计数。

如果"A"和"B"和"F"出现在所有10条记录中,而不是"A",10和"B",10和"C",10希望将其合并为"A","B","F",10

上述序列的示例(不包括所有组合(结果[列表[字符串]]

Map(
""A","B","F"" -> 10,
""A","B","D"" -> 4,
""A","B","P"" -> 2,
...
...
..
)

如果能给我任何 scala 代码/解决方案来获得此输出,将不胜感激。

假设将不同顺序的数据计为一组,例如:BAF,而 ABF 将位于一组中,则解决方案是。

//define the data
val a = Seq(
List("B","D","A","P","F"),
List("B","A","F"),
List("B","D","A","F"),
List("B","D","A","T","F"),
List("B","A","P","F"),
List("B","D","A","P","F"),
List("B","A","F"),
List("B","A","F"),
List("B","A","F"),
List("A","B","F")
)
//you need to sorted so B,A,F will be counted as the same as A,B,F
//as all other data with different sequence
val b = a.map(_.sorted)
//group by identity, and then count the length
b.groupBy(identity).collect{case (x, y) => (x, y.length)}

输出将如下所示:

res1: scala.collection.immutable.Map[List[String],Int] = HashMap(List(A, B, F, P) -> 1, List(A, B, D, F) -> 1, List(A, B, D, F, T) -> 1, List(A, B, D, F, P) -> 2, List(A, B, F) -> 5)

要了解有关 Scala 的 groupBy identity 如何工作的更多信息,您可以访问这篇文章

你的向量的格式不是正确的scala语法,我认为你的意思是这样的:

val items = Seq(
Seq("B", "D", "A", "P", "F"),
Seq("B", "A", "F"),
Seq("B", "D", "A", "F"),
Seq("B", "D", "A", "T", "F"),
Seq("B", "A", "P", "F"),
Seq("B", "D", "A", "P", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F")
)

听起来您要完成的是两个group by子句。首先,您希望从每个列表中获取所有组合,然后获取集合中最常见的组合,获取它们出现的频率,然后对于以相同频率发生的组,执行另一个group by并将它们合并在一起。

为此,您将需要以下函数在双倍分组后执行双重缩减。

步骤:

  1. 收集组的所有序列。在项目内部,我们计算该项目列表中元素的总组合,从而生成Seq[Seq[String]]组,其中Seq[String]是唯一组合。这是平展的,因为(1 to group.length)操作生成Seq[Seq[String]]Seq。然后,我们将向量中所有列表中的所有映射平展在一起,您必须获得Seq[Seq[String]]
  2. groupMapReduce函数用于计算某个组合出现的频率,然后为每个组合指定一个值 1 进行求和。这给出了任何特定组合出现频率的频率。
  3. 这些组再次分组,但这次按发生次数分组。因此,如果"A"和"B"都出现 10 次,它们将被分组在一起。
  4. 最终地图减少了累积的组
val combos = items.flatMap(group => (1 to group.length).flatMap(i => group.combinations(i).map(_.sorted)).distinct) // Seq[Seq[String]]
.groupMapReduce(identity)(_ => 1)(_ + _)  // Map[Seq[String, Int]]
.groupMapReduce(_._2)(v => Seq(v))(_ ++ _) // Map[Int, Seq[(Seq[String], Int)]]
.map { case (total, groups) => (groupReduction(groups), total)} // reduction function to determine how you want to double reduce these groups.

这个双约简函数我定义如下。它将像Seq("A","B")这样的组转换为""A","B"",然后如果Seq("A","B")与另一个group Seq("C")具有相同的计数,则该组将连接在一起,""A","B"","C""

def groupReduction(groups: Seq[(Seq[String], Int)]): String = {
groups.map(_._1.map(v => s"""$v""").sorted.mkString(",")).sorted.mkString(",")
}

可以针对(1 to group.length)子句中的特定感兴趣组调整此筛选器。如果限制为3 to 3,则这些组将是

List(List(B, D, P), List(A, D, P), List(D, F, P)): 2
List(List(A, B, F)): 10
List(List(B, D, F), List(A, D, F), List(A, B, D)): 4
List(List(A, F, P), List(B, F, P), List(A, B, P)): 3
List((List(B, D, T), List(A, F, T), List(B, F, T), List(A, D, T), List(A, B, T), List(D, F, T)): 1
As you can see in your example, `List(B, D, F)` and `List(A, D, F)` are also associated with your second line "A,B,D".

这里是:

scala> def count(seq: Seq[Seq[String]]): Map[Seq[String], Int] =
|   seq.flatMap(_.toSet.subsets.filter(_.nonEmpty)).groupMapReduce(identity)(_ => 1)(_ + _)
|      .toSeq.sortBy(-_._1.size).foldLeft(Map.empty[Set[String], Int]){ case (r, (p, i)) =>
|        if(r.exists{ (q, j) => i == j && p.subsetOf(q)}) r else r.updated(p, i)
|      }.map{ case(k, v) => (k.toSeq, v) }
def count(seq: Seq[Seq[String]]): Map[Seq[String], Int]
scala> count(Seq(
|   Seq("B", "D", "A", "P", "F"),
|   Seq("B", "A", "F"),
|   Seq("B", "D", "A", "F"),
|   Seq("B", "D", "A", "T", "F"),
|   Seq("B", "A", "P", "F"),
|   Seq("B", "D", "A", "P", "F"),
|   Seq("B", "A", "F"),
|   Seq("B", "A", "F"),
|   Seq("B", "A", "F"),
|   Seq("B", "A", "F")
| ))
val res1: Map[Seq[String], Int] = 
HashMap(List(F, A, B) -> 10, 
List(F, A, B, P, D) -> 2, 
List(T, F, A, B, D) -> 1, 
List(F, A, B, D) -> 4, 
List(F, A, B, P) -> 3)

如您所见,结果中的"A,B,D"和"A,B,P"减少了,因为"ABDF"和"ABPDF"的子集...

相关内容

  • 没有找到相关文章

最新更新