如何从本地pc上传大文件到DBFS?



我正在尝试学习数据库中的spark SQL,并希望与Yelp数据集合作;但是,该文件太大,无法从UI上传到DBFS。谢谢,菲利普

有几种方法:

  1. 使用Databricks CLI的dbfs命令上传本地数据到dbfs
  2. 直接从笔记本下载数据集,例如使用%sh wget URL,并将存档解压缩到DBFS(通过使用/dbfs/path/...作为目标,或使用dbutils.fs.cp命令将文件从驱动节点复制到DBFS)
  3. 上传文件到AWS S3, Azure数据湖存储,谷歌存储或类似的东西,并访问数据。

上传你想在Databricks中加载的文件到google drive

from urllib.request import urlopen
from shutil import copyfileobj
my_url = 'paste your url here'
my_filename = 'give your filename'
file_path = '/FileStore/tables' # location at which you want to move the downloaded file
# Downloading the file from google drive to Databrick
with urlopen(my_url) as in_stream, open(my_filename, 'wb') as out_file:
copyfileobj(in_stream, out_file)
# check where the file has download
# in my case it is
display(dbutils.fs.ls('file:/databricks/driver'))
# moving the file to desired location
# dbutils.fs.mv(downloaded_location, desired_location)
dbutils.fs.mv("file:/databricks/driver/my_file", file_path)

我希望这对你有帮助

最新更新