AWS Crawler S3目标路径更改,但包括旧路径表



我有一个AWS Crawler,为了切换底层表源,我正在切换s3目标路径。问题是两个目标都创建了表:

配置:

aws glue get-crawler --name sand-main 
{
"Crawler": {
"Name": "sand-main",
"Role": "Crawler-sand",
"Targets": {
"S3Targets": [
{
"Path": "s3://sand-main-green/main",
"Exclusions": [
"checkpoints/**",
"IsActive.txt",
"isactive.txt"
]
}
],
"JdbcTargets": [],
"MongoDBTargets": [],
"DynamoDBTargets": [],
"CatalogTargets": []
},
"DatabaseName": "sand_main",
"Description": "",
"Classifiers": [],
"RecrawlPolicy": {
"RecrawlBehavior": "CRAWL_EVERYTHING"
},
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "DELETE_FROM_DATABASE"
},
"LineageConfiguration": {
"CrawlerLineageSettings": "DISABLE"
},
"State": "READY",
"CrawlElapsedTime": 0,
"CreationTime": "2020-09-30T14:07:25-06:00",
"LastUpdated": "2021-01-28T11:32:15-07:00",
"LastCrawl": {
"Status": "SUCCEEDED",
"LogGroup": "/aws-glue/crawlers",
"LogStream": "sand-main",
"MessagePrefix": "5bb1907d-2847-46ef-8712-3a50deb2b7a0",
"StartTime": "2021-01-28T11:32:35-07:00"
},
"Version": 24,
"Configuration": "{"Version":1.0,"CrawlerOutput":{"Partitions":{"AddOrUpdateBehavior":"InheritFromTable"}},"Grouping":{"TableGroupingPolicy":"CombineCompatibleSchemas"}}"
}
}

我有一个lambda的路径,它将从:"Path": "s3://sand-main-green/main":"Path": "s3://sand-main-blue/main"

但是我最终得到了表格:

名称→
test ->s3://sand-main-blue/主/测试

test_2398l50df→s3://sand-main-green/主/测试

我有DELETE_IN_DATABASE,所以我希望旧的s3路径被删除。感觉爬虫保留了s3目标的历史记录。我不希望这种行为

通常爬行器以文件路径的最后一部分作为表名创建表(在您的示例中为"test")。如果数据库中已经存在一个表,它将创建一个新表,并使用随机字符作为后缀(在您的示例中为test_2398l50df)。

如果你想要表格"test"要设置为新路径,您应该按照以下顺序执行步骤:

  1. 运行位置为s3的爬虫://sand-main-blue/main/test (this创建"test"表)
  2. 删除"test"数据库
  3. 中的表
  4. 用新路径更新爬虫(s3://sand-main-green/main/test)
  5. 运行爬虫(这会创建"test">

最新更新