将从多个aws s3存储桶读取的数据帧级联会生成NoneType错误



我有一堆csv文件需要读取并分组到一个数据帧中。这些文件位于aws s3存储桶及其子文件夹中。读取这些文件并不是一个真正的问题。问题是,我在做这件事时收到了一条错误消息,但我无法弄清楚我做错了什么:

bucket = s3.Bucket('mybucket')
import io
prefix_objs = bucket.objects.filter(Prefix="folder/file_prefix")
df = pd.DataFrame()
for obj in prefix_objs:
key = obj.key
body = obj.get()['Body'].read()
temp = pd.read_csv(io.BytesIO(body), sep=",",encoding='utf8')        
frame =[df,temp]
df = pd.concat(frame)

因此,我的想法是从一个空的df开始,只在给定bucket文件夹中具有给定前缀的文件上使用read_csv

现在,我得到错误

AttributeError: 'NoneType' object has no attribute 'items'

但与此同时,我确实得到了一些信息,表明我离目标不远了:

id       date  success      on     off  quota  errors
0         130  2020-12-09     True     0.0     0.0  1.000     NaN
1         433  2020-12-09    False     NaN     NaN    NaN     NaN
2         810  2020-12-09     True     0.0     0.0  1.000     NaN
3        2889  2020-12-09     True  1653.0  1707.0  0.968     NaN
4        5410  2020-12-09    False     NaN     NaN    NaN     NaN
..        ...         ...      ...     ...     ...    ...     ...
2         810  2021-01-12     True    50.0    47.0  1.064     NaN
3        2889  2021-01-12     True   190.0   179.0  1.061     NaN
4        5410  2021-01-12     True     0.0     0.0  1.000     NaN
5        6069  2021-01-12     True  1736.0  1779.0  0.976     NaN
6        6128  2021-01-12     True     0.0     0.0  1.000     NaN
[232 rows x 7 columns]

我的代码中有什么地方出了问题,所以它会产生那个错误?如有任何帮助,我们将不胜感激。

另一种选择,如果不可能回答这个问题:

如果我将代码更改为

bucket = s3.Bucket('my bucket')
import io
prefix_objs = bucket.objects.filter(Prefix="folder/prefix")
df = []
for obj in prefix_objs:
key = obj.key
body = obj.get()['Body'].read()
temp = pd.read_csv(io.BytesIO(body), encoding='utf8',sep=",")        
df.append(temp)

我如何制作

[          Id        date  success      on     off  quota  errors
0        130  2020-12-09     True     0.0     0.0  1.000     NaN
1        433  2020-12-09    False     NaN     NaN    NaN     NaN
2        810  2020-12-09     True     0.0     0.0  1.000     NaN
3       2889  2020-12-09     True  1653.0  1707.0  0.968     NaN
4       5410  2020-12-09    False     NaN     NaN    NaN     NaN
5       6069  2020-12-09     True     0.0     0.0  1.000     NaN
6       6128  2020-12-09     True  2202.0  2182.0  1.009     NaN,
id        date  success      on     off  quota  errors
0        130  2020-12-10     True   634.0   556.0  1.140     NaN
1        433  2020-12-10    False     NaN     NaN    NaN     NaN
2        810  2020-12-10     True   464.0   442.0  1.050     NaN
3       2889  2020-12-10     True   940.0   915.0  1.027     NaN
4       5410  2020-12-10    False     NaN     NaN    NaN     NaN
5       6069  2020-12-10     True  2926.0  2879.0  1.016     NaN
6       6128  2020-12-10     True    32.0    32.0  1.000     NaN,
id        date  success      on     off  quota  errors
0        130  2020-12-11     True   366.0   341.0  1.073     NaN
1        433  2020-12-11    False     NaN     NaN    NaN     NaN
2        810  2020-12-11     True   204.0   201.0  1.015     NaN
3       2889  2020-12-11     True   359.0   362.0  0.992     NaN
4       5410  2020-12-11    False     NaN     NaN    NaN     NaN
5       6069  2020-12-11     True  1601.0  1588.0  1.008     NaN
6       6128  2020-12-11     True   703.0   705.0  0.997     NaN,
id        date  success     on    off  quota  errors
0        130  2020-12-12     True  162.0  153.0  1.059     NaN
1        433  2020-12-12    False    NaN    NaN    NaN     NaN
2        810  2020-12-12     True  153.0  147.0  1.041     NaN
3       2889  2020-12-12    False    NaN    NaN    NaN     NaN
4       5410  2020-12-12    False    NaN    NaN    NaN     NaN
5       6069  2020-12-12     True  690.0  701.0  0.984     NaN
6       6128  2020-12-12     True    0.0    0.0  1.000     NaN]

转换为数据帧?我试过DF = pd.DataFrame(df),但显然错了。

编辑:可再现数据

bucket中的所有csv文件都是这种形式的

Id,date,success,on,off,quota,errors
130,2020-12-09,True,0.0,0.0,1.000,
433,2020-12-09,False,,,,
810,2020-12-09,True,0.0,0.0,1.000,
2889,2020-12-09,True,1653.0,1707.0,0.968,
5410,2020-12-09,False,,,,
6069,2020-12-09,True,0.0,0.0,1.000,
6128,2020-12-09,True,2202.0,2182.0,1.009,

下面是的第二个例子

Id,date,success,on,off,quota,errors
130,2020-12-11,True,366.0,341.0,1.073,
433,2020-12-11,False,,,,
810,2020-12-11,True,204.0,201.0,1.015,
2889,2020-12-11,True,359.0,362.0,0.992,
5410,2020-12-11,False,,,,
6069,2020-12-11,True,1601.0,1588.0,1.008,
6128,2020-12-11,True,703.0,705.0,0.997,

所有缺失的值都保留为空。

下面的代码运行良好。

import pandas as pd
csv1 ="""Id,date,success,on,off,quota,errors
130,2020-12-09,True,0.0,0.0,1.000,
433,2020-12-09,False,,,,
810,2020-12-09,True,0.0,0.0,1.000,
2889,2020-12-09,True,1653.0,1707.0,0.968,
5410,2020-12-09,False,,,,
6069,2020-12-09,True,0.0,0.0,1.000,
6128,2020-12-09,True,2202.0,2182.0,1.009,"""
csv2="""Id,date,success,on,off,quota,errors
130,2020-12-11,True,366.0,341.0,1.073,
433,2020-12-11,False,,,,
810,2020-12-11,True,204.0,201.0,1.015,
2889,2020-12-11,True,359.0,362.0,0.992,
5410,2020-12-11,False,,,,
6069,2020-12-11,True,1601.0,1588.0,1.008,
6128,2020-12-11,True,703.0,705.0,0.997,"""
df1 = pd.read_csv(io.StringIO(csv1))
df2 = pd.read_csv(io.StringIO(csv2))
df = pd.concat([df1, df2])
df.to_csv('aws_toy.csv')
print(df, 'n')
avg_quota = df.groupby('date').agg(avg_quota=('quota', 'mean')).reset_index()
print(avg_quota, 'n')
select_quota = df.filter(['on'])
print(select_quota, 'n')

输出:

Id        date  success      on     off  quota  errors
0   130  2020-12-09     True     0.0     0.0  1.000     NaN
1   433  2020-12-09    False     NaN     NaN    NaN     NaN
2   810  2020-12-09     True     0.0     0.0  1.000     NaN
3  2889  2020-12-09     True  1653.0  1707.0  0.968     NaN
4  5410  2020-12-09    False     NaN     NaN    NaN     NaN
5  6069  2020-12-09     True     0.0     0.0  1.000     NaN
6  6128  2020-12-09     True  2202.0  2182.0  1.009     NaN
0   130  2020-12-11     True   366.0   341.0  1.073     NaN
1   433  2020-12-11    False     NaN     NaN    NaN     NaN
2   810  2020-12-11     True   204.0   201.0  1.015     NaN
3  2889  2020-12-11     True   359.0   362.0  0.992     NaN
4  5410  2020-12-11    False     NaN     NaN    NaN     NaN
5  6069  2020-12-11     True  1601.0  1588.0  1.008     NaN
6  6128  2020-12-11     True   703.0   705.0  0.997     NaN 
date  avg_quota
0  2020-12-09     0.9954
1  2020-12-11     1.0170 
on
0     0.0
1     NaN
2     0.0
3  1653.0
4     NaN
5     0.0
6  2202.0
0   366.0
1     NaN
2   204.0
3   359.0
4     NaN
5  1601.0
6   703.0 

最新更新