我有一个Seq[List[String]]。例:
Vector(
["B","D","A","P","F"],
["B","A","F"],
["B","D","A","F"],
["B","D","A","T","F"],
["B","A","P","F"],
["B","D","A","P","F"],
["B","A","F"],
["B","A","F"],
["B","A","F"],
["B","A","F"]
)
我想在 Map[String,Int] 中获取不同组合(如"A","B"(的计数,其中键(字符串(是元素组合,值 (Int( 是具有此组合的列表计数。
如果"A"和"B"和"F"出现在所有10条记录中,而不是"A",10和"B",10和"C",10希望将其合并为"A","B","F",10
上述序列的示例(不包括所有组合(结果[列表[字符串]]
Map(
""A","B","F"" -> 10,
""A","B","D"" -> 4,
""A","B","P"" -> 2,
...
...
..
)
如果能给我任何 scala 代码/解决方案来获得此输出,将不胜感激。
假设将不同顺序的数据计为一组,例如:BAF,而 ABF 将位于一组中,则解决方案是。
//define the data
val a = Seq(
List("B","D","A","P","F"),
List("B","A","F"),
List("B","D","A","F"),
List("B","D","A","T","F"),
List("B","A","P","F"),
List("B","D","A","P","F"),
List("B","A","F"),
List("B","A","F"),
List("B","A","F"),
List("A","B","F")
)
//you need to sorted so B,A,F will be counted as the same as A,B,F
//as all other data with different sequence
val b = a.map(_.sorted)
//group by identity, and then count the length
b.groupBy(identity).collect{case (x, y) => (x, y.length)}
输出将如下所示:
res1: scala.collection.immutable.Map[List[String],Int] = HashMap(List(A, B, F, P) -> 1, List(A, B, D, F) -> 1, List(A, B, D, F, T) -> 1, List(A, B, D, F, P) -> 2, List(A, B, F) -> 5)
要了解有关 Scala 的 groupBy identity 如何工作的更多信息,您可以访问这篇文章
你的向量的格式不是正确的scala语法,我认为你的意思是这样的:
val items = Seq(
Seq("B", "D", "A", "P", "F"),
Seq("B", "A", "F"),
Seq("B", "D", "A", "F"),
Seq("B", "D", "A", "T", "F"),
Seq("B", "A", "P", "F"),
Seq("B", "D", "A", "P", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F")
)
听起来您要完成的是两个group by
子句。首先,您希望从每个列表中获取所有组合,然后获取集合中最常见的组合,获取它们出现的频率,然后对于以相同频率发生的组,执行另一个group by
并将它们合并在一起。
为此,您将需要以下函数在双倍分组后执行双重缩减。
步骤:
- 收集组的所有序列。在项目内部,我们计算该项目列表中元素的总组合,从而生成
Seq[Seq[String]]
组,其中Seq[String]
是唯一组合。这是平展的,因为(1 to group.length)
操作生成Seq[Seq[String]]
的Seq
。然后,我们将向量中所有列表中的所有映射平展在一起,您必须获得Seq[Seq[String]]
groupMapReduce
函数用于计算某个组合出现的频率,然后为每个组合指定一个值 1 进行求和。这给出了任何特定组合出现频率的频率。- 这些组再次分组,但这次按发生次数分组。因此,如果"A"和"B"都出现 10 次,它们将被分组在一起。
- 最终地图减少了累积的组
val combos = items.flatMap(group => (1 to group.length).flatMap(i => group.combinations(i).map(_.sorted)).distinct) // Seq[Seq[String]]
.groupMapReduce(identity)(_ => 1)(_ + _) // Map[Seq[String, Int]]
.groupMapReduce(_._2)(v => Seq(v))(_ ++ _) // Map[Int, Seq[(Seq[String], Int)]]
.map { case (total, groups) => (groupReduction(groups), total)} // reduction function to determine how you want to double reduce these groups.
这个双约简函数我定义如下。它将像Seq("A","B")
这样的组转换为""A","B""
,然后如果Seq("A","B")
与另一个group Seq("C")
具有相同的计数,则该组将连接在一起,""A","B"","C""
def groupReduction(groups: Seq[(Seq[String], Int)]): String = {
groups.map(_._1.map(v => s"""$v""").sorted.mkString(",")).sorted.mkString(",")
}
可以针对(1 to group.length)
子句中的特定感兴趣组调整此筛选器。如果限制为3 to 3
,则这些组将是
List(List(B, D, P), List(A, D, P), List(D, F, P)): 2
List(List(A, B, F)): 10
List(List(B, D, F), List(A, D, F), List(A, B, D)): 4
List(List(A, F, P), List(B, F, P), List(A, B, P)): 3
List((List(B, D, T), List(A, F, T), List(B, F, T), List(A, D, T), List(A, B, T), List(D, F, T)): 1
As you can see in your example, `List(B, D, F)` and `List(A, D, F)` are also associated with your second line "A,B,D".
这里是:
scala> def count(seq: Seq[Seq[String]]): Map[Seq[String], Int] =
| seq.flatMap(_.toSet.subsets.filter(_.nonEmpty)).groupMapReduce(identity)(_ => 1)(_ + _)
| .toSeq.sortBy(-_._1.size).foldLeft(Map.empty[Set[String], Int]){ case (r, (p, i)) =>
| if(r.exists{ (q, j) => i == j && p.subsetOf(q)}) r else r.updated(p, i)
| }.map{ case(k, v) => (k.toSeq, v) }
def count(seq: Seq[Seq[String]]): Map[Seq[String], Int]
scala> count(Seq(
| Seq("B", "D", "A", "P", "F"),
| Seq("B", "A", "F"),
| Seq("B", "D", "A", "F"),
| Seq("B", "D", "A", "T", "F"),
| Seq("B", "A", "P", "F"),
| Seq("B", "D", "A", "P", "F"),
| Seq("B", "A", "F"),
| Seq("B", "A", "F"),
| Seq("B", "A", "F"),
| Seq("B", "A", "F")
| ))
val res1: Map[Seq[String], Int] =
HashMap(List(F, A, B) -> 10,
List(F, A, B, P, D) -> 2,
List(T, F, A, B, D) -> 1,
List(F, A, B, D) -> 4,
List(F, A, B, P) -> 3)
如您所见,结果中的"A,B,D"和"A,B,P"减少了,因为"ABDF"和"ABPDF"的子集...