使用Python和Boto3从S3读取多个CSV文件

我能够在python中使用boto3从S3桶中读取多个csv文件，并最终将这些文件合并到pandas中的单个数据框中。但是，在某些文件夹中有一些空文件，这会导致错误"没有列可以从文件中解析"。我们可以跳过下面代码中的空文件吗?

s3 = boto3.resource('s3')
bucket = s3.Bucket('testbucket')
prefix_objs = bucket.objects.filter(Prefix="extracted/abc")
prefix_df = []
for obj in prefix_objs:
key = obj.key
body = obj.get()['Body'].read()
temp = pd.read_csv(io.BytesIO(body),header=None, encoding='utf8',sep=',')        
prefix_df.append(temp)

我已经使用了这个ans [https://stackoverflow.com/questions/52855221/reading-multiple-csv-files-from-s3-bucket-with-boto3][1]

s3 = boto3.resource('s3')
bucket = s3.Bucket('testbucket')
prefix_objs = bucket.objects.filter(Prefix="extracted/abc")
prefix_df = []
for obj in prefix_objs:
try:
key = obj.key
body = obj.get()['Body'].read()
temp = pd.read_csv(io.BytesIO(body),header=None, encoding='utf8',sep=',')        
prefix_df.append(temp)
except:
continue

使用相同的代码得到与OP相同的错误。当我执行下面的代码来打印位于桶(testbucket)内的文件夹(inputfiles)中所有对象的名称时，我看到列出了3个键，尽管我只有2个对象。最后两个键列出了文件夹内的文本文件，这是我感兴趣的，而第一个键指向包含两个csv文件的文件夹。

s3 = boto3.resource('s3')
my_bucket = s3.Bucket('testbucket')
for file in my_bucket.objects.filter(Prefix="inputfiles/"):
print(file.key)

错误的原因:"No columns to parse from file"，是for循环试图解析文件夹，文件夹没有与之关联的'body'。当我们像下面这样使用try和exempt块时，代码会按预期执行，并打印导致错误的键的名称。

for file in my_bucket.objects.filter(Prefix="inputfiles/"):
try:
body = file.get()['Body'].read()
temp = pd.read_csv(io.BytesIO(body), encoding='utf8', sep=',')
print(temp.head()) ## you may print or append data to a data frame 
except:
print(file.key) ## This will print the key that has no columns to parse
continue

相关内容

最新更新

热门标签：