列出AWS Glue使用AWS Python SDK boto3从表中解析的所有s3文件



我试图通过Glue API文档找到一种方法,但是没有与函数get_table(**kwargs)get_tables(**kwargs)相关的属性或方法。

我想象类似于以下(伪)代码的东西:

client = boto3.client('glue')
paginator = client.get_paginator('get_tables')
for response in paginator.paginate(DatabaseName=db_input_shared):
for table in response['TableList']:
files = table["files"]  # NOTE: the keyword "files" is invented
# Do something else
...

就我从文档中看到的,tablereponse["TableList"]应该是一个字典;然而,似乎没有一个密钥可以访问存储在其中的文件。

解决这个问题的方法是使用awswrangler。

下面的函数检查数据库中所有的AWS Glue表,查找最近上传的文件的特定列表。只要文件名匹配,它就会生成关联的表字典。这些生成的表格是最近更新的。
def _yield_recently_updated_glue_tables(upload_path_list: List[str],
db_name: str) -> Union(dict, None):
"""Check which tables have been updated recently.
Args:
upload_path_list (List[str]): contains all S3-filepaths of recently uploaded files
db_name (str): name of the AWS Glue database
Yields:
Union(dict, None): AWS Glue table dictionaries recently updated
"""
client = boto3.client('glue')
paginator = client.get_paginator('get_tables')
for response in paginator.paginate(DatabaseName=db_name):
for table_dict in response['TableList']:
table_name = table_dict['Name']
s3_bucket_path = awswrangler.catalog.get_table_location(
database=db_name, table=table_name)
s3_filepaths = list(
awswrangler.s3.describe_objects(s3_bucket_path).keys())
table_was_updated = False
for upload_file in upload_path_list:
if upload_file in s3_filepaths:
table_was_updated = True
break
if table_was_updated:
yield table_dict

相关内容

  • 没有找到相关文章

最新更新