将多个CSV文件从Google Cloud Bucket导入Datalab



我正在尝试获取以下代码,以将多个csv文件(ALLOWANCE1.csv和ALLOWANCE2.csv)从Google Cloud Bucket导入python 2.x中的Datalab:

import numpy as np
import pandas as pd
from google.datalab import Context
import google.datalab.bigquery as bq
import google.datalab.storage as storage
from io import BytesIO
myBucket = storage.Bucket('Bucket Name')
object_list = myBucket.objects(prefix='ALLOWANCE')
df_list = []
for obj in object_list:
  %gcs read --object $obj.uri --variable data  
  df_list.append(pd.read_csv(BytesIO(data)))
concatenated_df = pd.concat(df_list, ignore_index=True)
concatenated_df.head()

我在 for 循环开始时收到以下错误:

RequestExceptionTraceback (most recent call last)
<ipython-input-5-3188aab389b8> in <module>()
----> 1 for obj in object_list:
     2   get_ipython().magic(u'gcs read --object $obj.uri --variable 
data')
     3   df_list.append(pd.read_csv(BytesIO(data)))
/usr/local/envs/py2env/lib/python2.7/site- 
packages/google/datalab/utils/_iterator.pyc in __iter__(self)
     34     """Provides iterator functionality."""
     35     while self._first_page or (self._page_token is not None):
---> 36       items, next_page_token = self._retriever(self._page_token, self._count)
 37 
 38       self._page_token = next_page_token
/usr/local/envs/py2env/lib/python2.7/site-packages/google/datalab/storage/_object.pyc in _retrieve_objects(self, page_token, _)
319                                          page_token=page_token)
320     except Exception as e:
--> 321       raise e
322 
323     objects = list_info.get('items', [])
RequestException: HTTP request failed: Not Found

我花了一些时间解决这个问题,但没有运气!任何帮助将不胜感激!

我认为你不能将笔记本 shell 命令与 python 变量混合使用。也许尝试使用子进程python lib并使用python调用命令行命令。

import numpy as np
import pandas as pd
from google.datalab import Context
import google.datalab.bigquery as bq
import google.datalab.storage as storage
from io import BytesIO
#new line
from subprocess import call  
from google.colab import auth  #new lines
auth.authenticate_user()

myBucket = storage.Bucket('Bucket Name')
object_list = myBucket.objects(prefix='ALLOWANCE')
df_list = []
for obj in object_list:
    call(['gsutil', 'cp', obj.uri, '/tmp/']) #first copy file
    filename = obj.uri.split('/')[-1] #get file name
    df_list.append(pd.read_csv('/tmp/' + filename))
concatenated_df = pd.concat(df_list, ignore_index=True)
concatenated_df.head()

请注意,我没有运行此代码,但已成功使用自己的文件运行"调用"。 另一个建议是先在一个循环中运行文件复制调用,然后再读取它们。这样,如果您对数据进行大量迭代,则不会每次都重新下载它们。

最新更新