Python-Extract-Boto3-试图将方法调用的结果作为参数传递给同一方法,然后循环



我在AWS S3上有一个多页pdf,并且正在使用textract提取所有文本。我可以批量获取响应,其中第一个响应为我提供了一个"NextToken",我需要将其作为arg传递给get_document_analysis方法。

如何避免每次手动粘贴从上次运行中收到的NextToken值时手动运行get_document_analysis方法?

这里有一个尝试:

import boto3
client = boto3.client('textract')
# Get my JobId
test_output = client.start_document_text_detection(DocumentLocation = {'S3Object': {'Bucket':'myawsbucket', 'Name':'mymuli-page-pdf-file.pdf'}})['JobId']
def my_output():
my_ls = []

# I need to repeat the the following function until the break condition further below
while True: 

# This returns a dictionary, with one key named NextToken, which value will need to be passed as an arg to the next iteration of the function
x=client.get_document_analysis(JobId = my_job_id_ref) 

# Assinging value of NextToken to a variable
next_token = x['NextToken'] 

#Running the function again, this time with the next_token passed as an argument.
x=client.get_document_analysis(JobId = my_job_id_ref, NextToken = next_token)

# Need to repeat the running of the function until there is no token. The token is normally a string, hence
if len(next_token) <1:
break

my_ls.append(x)

return my_ls

诀窍是使用while-条件来检查nextToken是否为空。

# Get the analysis once to see if there is a need to loop in the first place
x=client.get_document_analysis(JobId = my_job_id_ref) 
next_token = x.get('NextToken')
my_ls.append(x)
# Now repeat until we have the last page
while next_token is not None:
x = client.get_document_analysis(JobId = my_job_id_ref) 
next_token = x.get('NextToken')
my_ls.append(x)

next_token的值将被连续覆盖,直到它为None——在这一点上,我们脱离了循环。

请注意,我使用x.get(..)来检查响应字典是否包含NextToken。它可能一开始就没有设置,在这种情况下,.get(..)将始终返回None。(如果未设置NextToken,x["NextToken"]将抛出KeyError。(

相关内容

  • 没有找到相关文章

最新更新