如何使数据流从CSV在googleDrive到tf DataSet - in Colab



根据Colab的说明,我可以得到buffer &甚至可以带个pd。DataFrame from it(文件只是例子)…

# ... authentification    
file_id = '1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU' # titanic
# loading data
import io
from googleapiclient.http import MediaIoBaseDownload
drive_service = build('drive', 'v3')      # , credentials=creds
request = drive_service.files().get_media(fileId=file_id)
buf = io.BytesIO()
downloader = MediaIoBaseDownload(buf, request)
buf.seek(0)
import pandas as pd
df= pd.read_csv(buf);
print(df.head())

但是在正确创建数据流到数据集时遇到麻烦- "但是"=>

dataset = tf.data.experimental.make_csv_dataset(csv_file_path;batch_size = 100, num_epochs = 1)

只有"csv_file_path"作为第一个论点。是否有可能在Colab获得IO从我的GoogleDrive的csv文件到数据集(在培训中进一步使用)?如何以节省内存的方式做到这一点?

注:我明白我也许可以让文件(在GoogleDrive中)为所有人打开&使用简单的方法获取url:

#TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TRAIN_DATA_URL = "https://drive.google.com/file/d/1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU/view?usp=sharing"
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
dataset = tf.data.experimental.make_csv_dataset(train_file_path, batch_size=100, num_epochs=1) 

!但我不需要分享真正的文件…如何保存文件机密&从它获得IO(在GoogleDrive)到tf.data.Dataset在Colab ?(最好是最短的代码-在Colab中测试的实际项目中会有更多的代码)

驱动器。它仅仅帮助(链接)-正如我所理解的,在Colab工作-我在一个单独的环境中工作(与我的PC分开);我'net env)…所以我试着(根据链接)

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# https://drive.google.com/file/d/1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU/view?usp=sharing
link = 'https://drive.google.com/open?id=1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU'
fluff, id = link.split('=')
print (id) # Verify that you have everything after '='
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('Filename.csv')
import tensorflow as tf
ds = tf.data.experimental.make_csv_dataset('Filename.csv', batch_size=100, num_epochs=1) 
iterator = ds.as_numpy_iterator()
print(next(iterator))

它适合我。谢谢你对这个话题的兴趣(如果有人尝试的话)

更简单

# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')
_types = [float(),  float(), float(), float(),  str()]
_lines = tf.data.TextLineDataset('/content/drive/My Drive/iris.csv')
ds=_lines.skip(1).map(lambda x: tf.io.decode_csv(x, record_defaults=_types) )
ds0= ds.take(2)
print(*ds0.as_numpy_iterator(), sep='n')   # print list with sep => by rows.

from df:(为了节省内存而批处理)

import tensorflow as tf
# Load the Drive helper and mount
from google.colab import drive
drive.flush_and_unmount()
drive.mount('/content/drive')
df= pd.read_csv('/content/drive/My Drive/iris.csv', dtype = 'float32', converters = {'variety' : str},  nrows=20, decimal='.')
ds = tf.data.Dataset.from_tensor_slices(dict(df))   # if mixed types
ds = ds.shuffle(20, reshuffle_each_iteration=False )   #  for train.ds ONLY!
ds = ds.batch(batch_size=4)
ds = ds.prefetch(4)
# labels
label=  ds.map(lambda x: x['variety'])
print(list(label.as_numpy_iterator()))
# features
#features = ds.map(lambda x: (x['sepal.length'], x['sepal.width']))
# Or with dynamic keys:
features =  ds.map(lambda x: (list(map(x.get, list(np.setdiff1d(list(x.keys()),['variety']))))))
print(list(features.as_numpy_iterator()))

与任何转换在map…

相关内容

  • 没有找到相关文章

最新更新