多维数据集、汇总和分组依据运算符之间有什么区别?

我找不到有关差异的任何详细文档。

我确实注意到了差异，因为在交换cube和groupBy函数调用时，我得到的结果不同。我注意到对于使用cube的结果，我在以前使用groupBy的表达式上得到了很多 null 值。

这些不打算以相同的方式工作。groupBy只是标准 SQL 中的GROUP BY子句的等价物。换句话说，

table.groupBy($"foo", $"bar")

相当于：

SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar

cube相当于CUBE扩展GROUP BY。它采用列列表，并将聚合表达式应用于分组列的所有可能组合。假设你有这样的数据：

val df = Seq(("foo", 1L), ("foo", 2L), ("bar", 2L), ("bar", 2L)).toDF("x", "y")

df.show
// +---+---+
// |  x|  y|
// +---+---+
// |foo|  1|
// |foo|  2|
// |bar|  2|
// |bar|  2|
// +---+---+

并且您使用 count 作为聚合来计算cube(x, y)：

df.cube($"x", $"y").count.show
// +----+----+-----+     
// |   x|   y|count|
// +----+----+-----+
// |null|   1|    1|   <- count of records where y = 1
// |null|   2|    3|   <- count of records where y = 2
// | foo|null|    2|   <- count of records where x = foo
// | bar|   2|    2|   <- count of records where x = bar AND y = 2
// | foo|   1|    1|   <- count of records where x = foo AND y = 1
// | foo|   2|    1|   <- count of records where x = foo AND y = 2
// |null|null|    4|   <- total count of records
// | bar|null|    2|   <- count of records where x = bar
// +----+----+-----+

rollup与cube类似的函数，它从左到右计算分层小计：

df.rollup($"x", $"y").count.show
// +----+----+-----+
// |   x|   y|count|
// +----+----+-----+
// | foo|null|    2|   <- count where x is fixed to foo
// | bar|   2|    2|   <- count where x is fixed to bar and y is fixed to  2
// | foo|   1|    1|   ...
// | foo|   2|    1|   ...
// |null|null|    4|   <- count where no column is fixed
// | bar|null|    2|   <- count where x is fixed to bar
// +----+----+-----+

只是为了比较，让我们看看普通groupBy的结果：

df.groupBy($"x", $"y").count.show
// +---+---+-----+
// |  x|  y|count|
// +---+---+-----+
// |foo|  1|    1|   <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP
// |foo|  2|    1|   <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP
// |bar|  2|    2|   <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP
// +---+---+-----+

总结一下：

使用纯GROUP BY时，每一行在其相应的摘要中仅包含一次。

通过GROUP BY CUBE(..)每一行都包含在它所代表的每个级别组合的摘要中，包括通配符。从逻辑上讲，上面显示的内容等效于这样的东西(假设我们可以使用NULL占位符)：

SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x,    NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT NULL, y,    COUNT(*) FROM table GROUP BY y
UNION ALL
SELECT x,    y,    COUNT(*) FROM table GROUP BY x, y

与GROUP BY ROLLUP(...)类似CUBE，但通过从左到右填充列来分层工作。

SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x,    NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT x,    y,    COUNT(*) FROM table GROUP BY x, y

ROLLUP和CUBE来自数据仓库扩展，因此如果您想更好地了解其工作原理，还可以查看您喜欢的RDMBS的文档。例如，PostgreSQL在9.5中引入，这些都有相对完善的文档。

"家庭"中还有一个成员可以解释这一切 -GROUPING SETS.我们在 PySpark/Scala 中没有它，但它存在于 SQL API 中。

GROUPING SETS用于设计所需的任何分组组合。其他(cube、rollup、groupBy)返回预定义的存在组合：

cube("id", "x", "y")将返回()、(id)、(x)、(y)、(id, x)、(id, y)、(x, y)、(id, x, y).
(所有可能存在的组合)。

rollup("id", "x", "y")将只返回()、(id)、(id, x)、(id, x, y).
(包括所提供序列开头的组合)。

groupBy("id", "x", "y")只会返回(id, x, y)组合。

<小时 />

示例

输入 df：

df = spark.createDataFrame(
[("a", "foo", 1),
("a", "foo", 2),
("a", "bar", 2),
("a", "bar", 2)],
["id", "x", "y"])
df.createOrReplaceTempView("df")

cube

df.cube("id", "x", "y").count()

和...

spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY 
GROUPING SETS (
(),
(id),
(x),
(y),
(id, x),
(id, y),
(x, y),
(id, x, y)
)
""")

+----+----+----+-----+
|  id|   x|   y|count|
+----+----+----+-----+
|null|null|   2|    3|
|null|null|null|    4|
|   a|null|   2|    3|
|   a| foo|null|    2|
|   a| foo|   1|    1|
|   a|null|   1|    1|
|null| foo|null|    2|
|   a|null|null|    4|
|null|null|   1|    1|
|null| foo|   2|    1|
|null| foo|   1|    1|
|   a| foo|   2|    1|
|null| bar|null|    2|
|null| bar|   2|    2|
|   a| bar|null|    2|
|   a| bar|   2|    2|
+----+----+----+-----+

rollup

df.rollup("id", "x", "y").count()

和...GROUPING SETS ((), (id), (id, x), (id, x, y))

spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY 
GROUPING SETS (
(),
(id),
--(x),      <- (not used)
--(y),      <- (not used)
(id, x),
--(id, y),  <- (not used)
--(x, y),   <- (not used)
(id, x, y)
)
""")

+----+----+----+-----+
|  id|   x|   y|count|
+----+----+----+-----+
|null|null|null|    4|
|   a| foo|null|    2|
|   a| foo|   1|    1|
|   a|null|null|    4|
|   a| foo|   2|    1|
|   a| bar|null|    2|
|   a| bar|   2|    2|
+----+----+----+-----+

groupBy

df.groupBy("id", "x", "y").count()

和...GROUPING SETS ((id, x, y))

spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY 
GROUPING SETS (
--(),       <- (not used)
--(id),     <- (not used)
--(x),      <- (not used)
--(y),      <- (not used)
--(id, x),  <- (not used)
--(id, y),  <- (not used)
--(x, y),   <- (not used)
(id, x, y)
)
""")

+---+---+---+-----+
| id|  x|  y|count|
+---+---+---+-----+
|  a|foo|  2|    1|
|  a|foo|  1|    1|
|  a|bar|  2|    2|
+---+---+---+-----+

<小时 />

注意。以上所有返回存在的组合。在示例数据帧中，没有用于"id":"a", "x":"bar", "y":1行。即使cube也不会返回它。为了获得所有可能的组合(存在与否)，我们应该执行以下操作(crossJoin)：

df_cartesian = spark.range(1).toDF('_tmp')
for c in (cols:=["id", "x", "y"]):
df_cartesian = df_cartesian.crossJoin(df.select(c).distinct())
df_final = (df_cartesian.drop("_tmp")
.join(df.cube(*cols).count(), cols, 'full')
)
df_final.show()
# +----+----+----+-----+
# |  id|   x|   y|count|
# +----+----+----+-----+
# |null|null|null|    4|
# |null|null|   1|    1|
# |null|null|   2|    3|
# |null| bar|null|    2|
# |null| bar|   2|    2|
# |null| foo|null|    2|
# |null| foo|   1|    1|
# |null| foo|   2|    1|
# |   a|null|null|    4|
# |   a|null|   1|    1|
# |   a|null|   2|    3|
# |   a| bar|null|    2|
# |   a| bar|   1| null|
# |   a| bar|   2|    2|
# |   a| foo|null|    2|
# |   a| foo|   1|    1|
# |   a| foo|   2|    1|
# +----+----+----+-----+

如果您不想要 null，请先使用以下示例将其删除 Dfwithoutnull=df.na.drop("all"，seq(col name 1，col name 2)) 上面的表达式将从原始数据帧中删除空

2.分组由你知道我猜。

3.汇总和多维数据集是分组集运算符。汇总是一种多维聚合和分层处理元素

在立方体中，立方体不是分层处理元素，而是在所有维度上做同样的事情。您可以尝试grouping_id来理解抽象级别

相关内容

最新更新

热门标签：