AWS Glue python作业限制了在S3 bucket中写入的数据量



我创建了一个Glue作业,从Glue目录中读取数据,并将其保存到镶木地板格式的s3桶中。它工作正常,但项目数量限制为20个。因此,每次触发作业时,只有20个项目保存在bucket中,我想保存所有项目。也许我在python脚本中缺少了一些附加属性。

以下是脚本(由AWS生成(:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
transformation_ctx = "datasource0"]
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "cargoprobe_data", table_name = "dev_scv_completed_executions", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [*field list*], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://bucketname"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

这是在后台自动完成的,称为分区。您可以通过调用进行重新分区

partitioned_df = dropnullfields3.repartition(1)

DynamicFrame重新分区为一个文件。

最新更新