Spark：保存按"virtual"列分区的数据帧

我正在使用 PySpark 执行经典的 ETL 作业(加载数据集、处理数据集、保存数据集(，并希望将我的数据帧保存为由"虚拟"列分区的文件/目录; 我所说的"虚拟"是指我有一列时间戳，它是一个包含 ISO 8601 编码日期的字符串，我想按年/月/日进行分区; 但我实际上也没有一年，数据帧中的"月"或"日"列;我有这个时间戳，我可以从中派生这些列，但我不希望我的结果项目序列化这些列之一。

将数据帧保存到磁盘生成的文件结构应如下所示：

/ 
    year=2016/
        month=01/
            day=01/
                part-****.gz

有没有办法用Spark/Pyspark做我想做的事？

用于分区的列不包括在序列化数据本身中。例如，如果您创建如下DataFrame：

df = sc.parallelize([
    (1, "foo", 2.0, "2016-02-16"),
    (2, "bar", 3.0, "2016-02-16")
]).toDF(["id", "x", "y", "date"])

并写如下：

import tempfile
from pyspark.sql.functions import col, dayofmonth, month, year
outdir = tempfile.mktemp()
dt = col("date").cast("date")
fname = [(year, "year"), (month, "month"), (dayofmonth, "day")]
exprs = [col("*")] + [f(dt).alias(name) for f, name in fname]
(df
    .select(*exprs)
    .write
    .partitionBy(*(name for _, name in fname))
    .format("json")
    .save(outdir))

单个文件不包含分区列：

import os
(sqlContext.read
    .json(os.path.join(outdir, "year=2016/month=2/day=16/"))
    .printSchema())
## root
##  |-- date: string (nullable = true)
##  |-- id: long (nullable = true)
##  |-- x: string (nullable = true)
##  |-- y: double (nullable = true)

分区数据仅存储在目录结构中，不会在序列化文件中重复。仅当您的读取完成或部分目录树时，才会附加它：

sqlContext.read.json(outdir).printSchema()
## root
##  |-- date: string (nullable = true)
##  |-- id: long (nullable = true)
##  |-- x: string (nullable = true)
##  |-- y: double (nullable = true)
##  |-- year: integer (nullable = true)
##  |-- month: integer (nullable = true)
##  |-- day: integer (nullable = true)
sqlContext.read.json(os.path.join(outdir, "year=2016/month=2/")).printSchema()
## root
##  |-- date: string (nullable = true)
##  |-- id: long (nullable = true)
##  |-- x: string (nullable = true)
##  |-- y: double (nullable = true)
##  |-- day: integer (nullable = true)

相关内容

最新更新

热门标签：