pyspark aws glue错误,设备上没有剩余空间



我使用AWS胶水将dynamoDB项提取到S3中。我使用pyspark和glue读取所有项目,并对从DynamoDB检索到的项目应用转换并写入S3。但是我总是遇到错误">设备上没有空间!">

我使用的worker类型是G.1X,每个worker映射到1个DPU (4vcpu, 16gb内存,64gb磁盘),dynamoDB大小为6GB。

根据AWS文档,在shuffle过程中,数据被写入磁盘并通过网络传输。因此,shuffle操作被绑定到本地磁盘容量我如何以编程方式设置洗牌?下面是我的示例代码,

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import Map
from awsglue.transforms import Filter
from pyspark import SparkConf
conf = SparkConf()
glue_context = GlueContext(SparkContext.getOrCreate())

# mytable got id and uri
resources_table_dynamic_frame = glue_context.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={
"dynamodb.input.tableName": "my_table",
"dynamodb.throughput.read.percent": "0.4",
"dynamodb.splits": "8"
}
)
# Filter out rows whose ids are same
def filter_new_id(dynamicRecord):
uri = dynamicRecord['Uri']
uri_split = uri.split(":")
# Get the internal ID
internal_id = uri_split[1]
print(dynamicRecord)
if internal_id == dynamicRecord['id']:
return False
return True

# Keep only the items whose IDs are different.
resource_with_old_id = Filter.apply(
frame=resources_table_dynamic_frame,
f=lambda x: filter_new_id(x),
transformation_ctx='resource_with_old_id'
)
glue_context.write_dynamic_frame_from_options(
frame=resource_with_old_id,
connection_type="s3",
connection_options={"path": "s3://path/"},
format="json"
)

我通过在op中发布的代码中进行以下调整来解决此问题。

resources_table_dynamic_frame = glue_context.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={
"dynamodb.input.tableName": "my_table",
"dynamodb.throughput.read.percent": "0.5",
"dynamodb.splits": "200"
},
additional_options={
"boundedFiles" : "30000"
}
)

我按照AWS文档中的建议添加了boundedFiles,并增加了dynamodb.splits以使其为我工作。

最新更新