我有一堆csv
文件需要读取并分组到一个数据帧中。这些文件位于aws s3
存储桶及其子文件夹中。读取这些文件并不是一个真正的问题。问题是,我在做这件事时收到了一条错误消息,但我无法弄清楚我做错了什么:
bucket = s3.Bucket('mybucket')
import io
prefix_objs = bucket.objects.filter(Prefix="folder/file_prefix")
df = pd.DataFrame()
for obj in prefix_objs:
key = obj.key
body = obj.get()['Body'].read()
temp = pd.read_csv(io.BytesIO(body), sep=",",encoding='utf8')
frame =[df,temp]
df = pd.concat(frame)
因此,我的想法是从一个空的df
开始,只在给定bucket文件夹中具有给定前缀的文件上使用read_csv
。
现在,我得到错误
AttributeError: 'NoneType' object has no attribute 'items'
但与此同时,我确实得到了一些信息,表明我离目标不远了:
id date success on off quota errors
0 130 2020-12-09 True 0.0 0.0 1.000 NaN
1 433 2020-12-09 False NaN NaN NaN NaN
2 810 2020-12-09 True 0.0 0.0 1.000 NaN
3 2889 2020-12-09 True 1653.0 1707.0 0.968 NaN
4 5410 2020-12-09 False NaN NaN NaN NaN
.. ... ... ... ... ... ... ...
2 810 2021-01-12 True 50.0 47.0 1.064 NaN
3 2889 2021-01-12 True 190.0 179.0 1.061 NaN
4 5410 2021-01-12 True 0.0 0.0 1.000 NaN
5 6069 2021-01-12 True 1736.0 1779.0 0.976 NaN
6 6128 2021-01-12 True 0.0 0.0 1.000 NaN
[232 rows x 7 columns]
我的代码中有什么地方出了问题,所以它会产生那个错误?如有任何帮助,我们将不胜感激。
另一种选择,如果不可能回答这个问题:
如果我将代码更改为
bucket = s3.Bucket('my bucket')
import io
prefix_objs = bucket.objects.filter(Prefix="folder/prefix")
df = []
for obj in prefix_objs:
key = obj.key
body = obj.get()['Body'].read()
temp = pd.read_csv(io.BytesIO(body), encoding='utf8',sep=",")
df.append(temp)
我如何制作
[ Id date success on off quota errors
0 130 2020-12-09 True 0.0 0.0 1.000 NaN
1 433 2020-12-09 False NaN NaN NaN NaN
2 810 2020-12-09 True 0.0 0.0 1.000 NaN
3 2889 2020-12-09 True 1653.0 1707.0 0.968 NaN
4 5410 2020-12-09 False NaN NaN NaN NaN
5 6069 2020-12-09 True 0.0 0.0 1.000 NaN
6 6128 2020-12-09 True 2202.0 2182.0 1.009 NaN,
id date success on off quota errors
0 130 2020-12-10 True 634.0 556.0 1.140 NaN
1 433 2020-12-10 False NaN NaN NaN NaN
2 810 2020-12-10 True 464.0 442.0 1.050 NaN
3 2889 2020-12-10 True 940.0 915.0 1.027 NaN
4 5410 2020-12-10 False NaN NaN NaN NaN
5 6069 2020-12-10 True 2926.0 2879.0 1.016 NaN
6 6128 2020-12-10 True 32.0 32.0 1.000 NaN,
id date success on off quota errors
0 130 2020-12-11 True 366.0 341.0 1.073 NaN
1 433 2020-12-11 False NaN NaN NaN NaN
2 810 2020-12-11 True 204.0 201.0 1.015 NaN
3 2889 2020-12-11 True 359.0 362.0 0.992 NaN
4 5410 2020-12-11 False NaN NaN NaN NaN
5 6069 2020-12-11 True 1601.0 1588.0 1.008 NaN
6 6128 2020-12-11 True 703.0 705.0 0.997 NaN,
id date success on off quota errors
0 130 2020-12-12 True 162.0 153.0 1.059 NaN
1 433 2020-12-12 False NaN NaN NaN NaN
2 810 2020-12-12 True 153.0 147.0 1.041 NaN
3 2889 2020-12-12 False NaN NaN NaN NaN
4 5410 2020-12-12 False NaN NaN NaN NaN
5 6069 2020-12-12 True 690.0 701.0 0.984 NaN
6 6128 2020-12-12 True 0.0 0.0 1.000 NaN]
转换为数据帧?我试过DF = pd.DataFrame(df)
,但显然错了。
编辑:可再现数据
bucket中的所有csv文件都是这种形式的
Id,date,success,on,off,quota,errors
130,2020-12-09,True,0.0,0.0,1.000,
433,2020-12-09,False,,,,
810,2020-12-09,True,0.0,0.0,1.000,
2889,2020-12-09,True,1653.0,1707.0,0.968,
5410,2020-12-09,False,,,,
6069,2020-12-09,True,0.0,0.0,1.000,
6128,2020-12-09,True,2202.0,2182.0,1.009,
下面是的第二个例子
Id,date,success,on,off,quota,errors
130,2020-12-11,True,366.0,341.0,1.073,
433,2020-12-11,False,,,,
810,2020-12-11,True,204.0,201.0,1.015,
2889,2020-12-11,True,359.0,362.0,0.992,
5410,2020-12-11,False,,,,
6069,2020-12-11,True,1601.0,1588.0,1.008,
6128,2020-12-11,True,703.0,705.0,0.997,
所有缺失的值都保留为空。
下面的代码运行良好。
import pandas as pd
csv1 ="""Id,date,success,on,off,quota,errors
130,2020-12-09,True,0.0,0.0,1.000,
433,2020-12-09,False,,,,
810,2020-12-09,True,0.0,0.0,1.000,
2889,2020-12-09,True,1653.0,1707.0,0.968,
5410,2020-12-09,False,,,,
6069,2020-12-09,True,0.0,0.0,1.000,
6128,2020-12-09,True,2202.0,2182.0,1.009,"""
csv2="""Id,date,success,on,off,quota,errors
130,2020-12-11,True,366.0,341.0,1.073,
433,2020-12-11,False,,,,
810,2020-12-11,True,204.0,201.0,1.015,
2889,2020-12-11,True,359.0,362.0,0.992,
5410,2020-12-11,False,,,,
6069,2020-12-11,True,1601.0,1588.0,1.008,
6128,2020-12-11,True,703.0,705.0,0.997,"""
df1 = pd.read_csv(io.StringIO(csv1))
df2 = pd.read_csv(io.StringIO(csv2))
df = pd.concat([df1, df2])
df.to_csv('aws_toy.csv')
print(df, 'n')
avg_quota = df.groupby('date').agg(avg_quota=('quota', 'mean')).reset_index()
print(avg_quota, 'n')
select_quota = df.filter(['on'])
print(select_quota, 'n')
输出:
Id date success on off quota errors
0 130 2020-12-09 True 0.0 0.0 1.000 NaN
1 433 2020-12-09 False NaN NaN NaN NaN
2 810 2020-12-09 True 0.0 0.0 1.000 NaN
3 2889 2020-12-09 True 1653.0 1707.0 0.968 NaN
4 5410 2020-12-09 False NaN NaN NaN NaN
5 6069 2020-12-09 True 0.0 0.0 1.000 NaN
6 6128 2020-12-09 True 2202.0 2182.0 1.009 NaN
0 130 2020-12-11 True 366.0 341.0 1.073 NaN
1 433 2020-12-11 False NaN NaN NaN NaN
2 810 2020-12-11 True 204.0 201.0 1.015 NaN
3 2889 2020-12-11 True 359.0 362.0 0.992 NaN
4 5410 2020-12-11 False NaN NaN NaN NaN
5 6069 2020-12-11 True 1601.0 1588.0 1.008 NaN
6 6128 2020-12-11 True 703.0 705.0 0.997 NaN
date avg_quota
0 2020-12-09 0.9954
1 2020-12-11 1.0170
on
0 0.0
1 NaN
2 0.0
3 1653.0
4 NaN
5 0.0
6 2202.0
0 366.0
1 NaN
2 204.0
3 359.0
4 NaN
5 1601.0
6 703.0