我有一个AWS Crawler,为了切换底层表源,我正在切换s3目标路径。问题是两个目标都创建了表:
配置:
aws glue get-crawler --name sand-main
{
"Crawler": {
"Name": "sand-main",
"Role": "Crawler-sand",
"Targets": {
"S3Targets": [
{
"Path": "s3://sand-main-green/main",
"Exclusions": [
"checkpoints/**",
"IsActive.txt",
"isactive.txt"
]
}
],
"JdbcTargets": [],
"MongoDBTargets": [],
"DynamoDBTargets": [],
"CatalogTargets": []
},
"DatabaseName": "sand_main",
"Description": "",
"Classifiers": [],
"RecrawlPolicy": {
"RecrawlBehavior": "CRAWL_EVERYTHING"
},
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "DELETE_FROM_DATABASE"
},
"LineageConfiguration": {
"CrawlerLineageSettings": "DISABLE"
},
"State": "READY",
"CrawlElapsedTime": 0,
"CreationTime": "2020-09-30T14:07:25-06:00",
"LastUpdated": "2021-01-28T11:32:15-07:00",
"LastCrawl": {
"Status": "SUCCEEDED",
"LogGroup": "/aws-glue/crawlers",
"LogStream": "sand-main",
"MessagePrefix": "5bb1907d-2847-46ef-8712-3a50deb2b7a0",
"StartTime": "2021-01-28T11:32:35-07:00"
},
"Version": 24,
"Configuration": "{"Version":1.0,"CrawlerOutput":{"Partitions":{"AddOrUpdateBehavior":"InheritFromTable"}},"Grouping":{"TableGroupingPolicy":"CombineCompatibleSchemas"}}"
}
}
我有一个lambda的路径,它将从:"Path": "s3://sand-main-green/main"
:"Path": "s3://sand-main-blue/main"
但是我最终得到了表格:
名称→
test ->s3://sand-main-blue/主/测试test_2398l50df→s3://sand-main-green/主/测试
我有DELETE_IN_DATABASE
,所以我希望旧的s3路径被删除。感觉爬虫保留了s3目标的历史记录。我不希望这种行为
通常爬行器以文件路径的最后一部分作为表名创建表(在您的示例中为"test")。如果数据库中已经存在一个表,它将创建一个新表,并使用随机字符作为后缀(在您的示例中为test_2398l50df)。
如果你想要表格"test"要设置为新路径,您应该按照以下顺序执行步骤:
- 运行位置为s3的爬虫://sand-main-blue/main/test (this创建"test"表)
- 删除"test"数据库 中的表
- 用新路径更新爬虫(s3://sand-main-green/main/test)
- 运行爬虫(这会创建"test">