如何在Glue中使用generandb.output.numParallelTasks

我们在Glue Job中使用了generandb.output.numParallelTasks，但我们不知道它到底在做什么或它的主要功能。

有人用过吗？

文档中明确定义了AWS Glue中ETL的连接类型和选项-AWS Glue：

"dynamodb.output.numParallelTasks"：(可选)定义同时有多少并行任务写入DynamoDB。用于计算每个Spark任务的许可WCU。如果您不想控制这些详细信息，则无需指定此参数。

permissiveWcuPerTask = TableWCU * dynamodb.throughput.write.percent / dynamodb.output.numParallelTasks

如果未指定此参数，则每个Spark任务的许可WCU将通过以下公式自动计算：

numPartitions = dynamicframe.getNumPartitions()

numExecutors =

(DPU - 1) * 2 - 1 if WorkerType is Standard

如果WorkerType是G.1X或G.2X，则为(NumberOfWorkers - 1)

numSlotsPerExecutor =

4如果WorkerType是Standard

如果WorkerType是G.1X，则为8

如果WorkerType是G.2X，则为16

numSlots = numSlotsPerExecutor * numExecutors

numParallelTasks = min(numPartitions, numSlots)

示例1。DPU=10，WorkerType=标准。Input DynamicFrame有100个RDD分区。

numPartitions = 100

numExecutors = (10 - 1) * 2 - 1 = 17

numSlots = 4 * 17 = 68

numParallelTasks = min(100, 68) = 68

示例2。DPU=10，WorkerType=标准。Input DynamicFrame有20个RDD分区。

numPartitions = 20

numExecutors = (10 - 1) * 2 - 1 = 17

numSlots = 4 * 17 = 68

numParallelTasks = min(20, 68) = 20

相关内容

最新更新

热门标签：