我有一个存储在parquet中的数据集,其中包含几个短键字段和一个相对较大(几kb)的blob字段。数据集按key1, key2排序
message spark_schema {
optional binary key1 (UTF8);
optional binary key2;
optional binary blob;
}
这个数据集的一个用例是获取给定谓词key1, key2的所有blob。我希望parquet谓词下推不会从键上的谓词匹配零记录的行组中读取blob,从而大大帮助。然而,情况似乎并非如此。
对于只返回2行(600万行中)的谓词,这个查询:
select sum(length(blob)) from t2 where key1 = 'rare value'
比下面的查询多花5倍的时间,多读50倍的数据(根据web UI):
select sum(length(key2)) from t2 where key1 = 'rare value'
拼花扫描似乎确实获得了谓词(如explain()
,见下文),并且这些列甚至似乎是字典编码的(见下文)。
所以拼花过滤器下推实际上不允许我们读取更少的数据,还是我的设置有问题?
scala> spark.sql("select sum(length(blob)) from t2 where key1 = 'rare value'").explain()
== Physical Plan ==
*HashAggregate(keys=[], functions=[sum(cast(length(blob#48) as bigint))])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_sum(cast(length(blob#48) as bigint))])
+- *Project [blob#48]
+- *Filter (isnotnull(key1#46) && (key1#46 = rare value))
+- *BatchedScan parquet [key1#46,blob#48] Format: ParquetFormat, InputPaths: hdfs://nameservice1/user/me/parquet_test/blob, PushedFilters: [IsNotNull(key1), EqualTo(key1,rare value)], ReadSchema: struct<key1:string,blob:binary>
$ parquet-tools meta example.snappy.parquet
creator: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"key1","type":"string","nullable":true,"metadata":{}},{"name":"key2","type":"binary","nullable":true,"metadata":{}},{" [more]...
file schema: spark_schema
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
key1: OPTIONAL BINARY O:UTF8 R:0 D:1
key2: OPTIONAL BINARY R:0 D:1
blob: OPTIONAL BINARY R:0 D:1
row group 1: RC:3971 TS:320593029
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
key1: BINARY SNAPPY DO:0 FPO:4 SZ:84/80/0.95 VC:3971 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
key2: BINARY SNAPPY DO:0 FPO:88 SZ:49582/53233/1.07 VC:3971 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
blob: BINARY SNAPPY DO:0 FPO:49670 SZ:134006918/320539716/2.39 VC:3971 ENC:BIT_PACKED,RLE,PLAIN
(重复…)
有人在spark用户列表中向我指出,parquet中有一些bug会影响当前版本的spark,最终会破坏二进制和字符串类型的谓词下推。我的用例使用整数键按预期工作。固定在Parquet 1.8.1。详见PARQUET-251和PARQUET-297。(致谢:Robert Kruszewski)。
当使用Spark 2.1.0时,它现在应该可以工作了,显示以下元数据:
creator: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
写入新数据时