SPARK SQL:NULL值已转换为结果文件中的空字符串

我已经在AWS胶水中编写了一个脚本，用于读取AWS S3的CSV文件，在几个字段上应用Null检查并将结果存储回S3作为新文件。问题是，如果该值为null，则遇到字符串类型的字段时，它将被转换为空字符串。但是我不希望这种转换发生。对于所有其他数据类型，它都可以正常工作。

这是到目前为止写的脚本：

glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
# s3 output directory
output_dir = "s3://aws-glue-scripts/..."
# Data Catalog: database and table name
db_name = "sampledb"
tbl_name = "mytable"
datasource = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_name)
datasource_df = datasource.toDF()   
datasource_df.createOrReplaceTempView("myNewTable")
datasource_sql_df = spark.sql("SELECT * FROM myNewTable WHERE name IS NULL")
datasource_sql_df.show()
datasource_sql_dyf = DynamicFrame.fromDF(datasource_sql_df, glueContext, "datasource_sql_dyf")
glueContext.write_dynamic_frame.from_options(frame = datasource_sql_dyf, 
connection_type = "s3", connection_options = {"path": output_dir}, format = "json")

任何人都可以建议如何摆脱这个问题吗？

谢谢。

我认为目前不可能。在编写JSON时，Spark被配置为忽略零值。在CSV读取器中，它明确地将空值放为空。

相关内容

最新更新

热门标签：