使用列值作为文件名来保存 Spark 数据帧



如何使用列值作为文件名将 Spark 数据帧保存到文件 . 可能吗?

+--------------------------+----------+-----------------+-----------------------------------+
|ID                        |CITY      |DATE             |name                               |
+--------------------------+----------+-----------------+-----------------------------------+
|1                         |          |2011-01-01       |20110101_DATA.snappy.parquet       |
|2                         |          |2011-01-01       |20110101_DATA.snappy.parquet       |
|3                         |          |2011-01-01       |20110101_DATA.snappy.parquet       |
|4                         |Chicago   |2011-01-01       |20110101_DATA.snappy.parquet       |
|5                         |Mansfield |2011-01-02       |20110102_DATA.snappy.parquet       |
|6                         |Pittsburgh|2011-01-02       |20110102_DATA.snappy.parquet       |
|7                         |          |2011-01-02       |20110102_DATA.snappy.parquet       |
|8                         |Clarion   |2011-01-03       |20110103_DATA.snappy.parquet       |
|9                         |Storrs    |2011-01-03       |20110103_DATA.snappy.parquet       |
|10                        |          |2011-01-03       |20110103_DATA.snappy.parquet       |
+--------------------------+----------+-----------------+-----------------------------------+

预期输出:

按日期分区,并在将数据另存为镶木地板时使用名称值作为文件名。o/p 将是 3 个文件

/DATE=2011-01-01/20110101_DATA.snappy.parquet
/DATE=2011-01-02/20110102_DATA.snappy.parquet
/DATE=2011-01-03/20110103_DATA.snappy.parquet

Spark 无法根据需要在输出拼花地板文件中本机创建自定义名称。可以使用以下代码,但它不是可缩放的解决方案,因为使用.collect()操作。

# In large dataframe maybe it will not work
unique_filename = [row.name for row in df.select('name').distinct().collect()]
for filename in  unique_filenames:
output_filename = "/DATE=" + filename[0:4] + "-" + filename[4:6] + "-" + filename[6:8] + "/" + filename
df.select("ID", "CITY", "DATE") 
.filter(df['name']==filename) 
.write 
.parquet(output_filename)

你会得到你想要的:

/DATE=2011-01-01/20110101_DATA.snappy.parquet
/DATE=2011-01-02/20110102_DATA.snappy.parquet
/DATE=2011-01-03/20110103_DATA.snappy.parquet

最新更新