在Spark DataFrame(所有组合)中的多个列上汇总

我想根据我在dataFrame中拥有的所有列的组合来获取客户计数。

for eg： - 假设如果我有5列的数据框。

id，col1，col2，col3，cust_id

我需要所有组合的客户计数：

    id, col1, count(cust_id)
    id, col1, col2, count(cust_id)
    id, col1, col3, count(cust_id)
    id, col1, col2, col3, count(cust_id)
    id, col2, count(cust_id)
    id, col2, col3, count(cust_id)

等，以供所有排列和组合。

很难单独地为数据框架的组函数提供所有不同的组合，然后汇总客户计数。

是否有任何方法可以实现此目标，然后将所有结果组合在一个数据框架中，我们可以将结果写入一个输出文件中。

对我来说，它看起来很复杂，如果有人能提供任何解决方案，真的很感激。请让我知道是否还需要任何详细信息。

非常感谢。

这是可能的，称为 cube：

df.cube("id", "col1", "col2", "col3").agg(count("cust_id"))
  .na.drop(minNonNulls=3)  // To exclude some combinations

SQL版本还提供了一个GROUPING SET，它比.na.drop更有效。

相关内容

最新更新

热门标签：