我正在尝试学习数据库中的spark SQL,并希望与Yelp数据集合作;但是,该文件太大,无法从UI上传到DBFS。谢谢,菲利普
有几种方法:
- 使用Databricks CLI的dbfs命令上传本地数据到dbfs
- 直接从笔记本下载数据集,例如使用
%sh wget URL
,并将存档解压缩到DBFS(通过使用/dbfs/path/...
作为目标,或使用dbutils.fs.cp
命令将文件从驱动节点复制到DBFS) - 上传文件到AWS S3, Azure数据湖存储,谷歌存储或类似的东西,并访问数据。
上传你想在Databricks中加载的文件到google drive
from urllib.request import urlopen
from shutil import copyfileobj
my_url = 'paste your url here'
my_filename = 'give your filename'
file_path = '/FileStore/tables' # location at which you want to move the downloaded file
# Downloading the file from google drive to Databrick
with urlopen(my_url) as in_stream, open(my_filename, 'wb') as out_file:
copyfileobj(in_stream, out_file)
# check where the file has download
# in my case it is
display(dbutils.fs.ls('file:/databricks/driver'))
# moving the file to desired location
# dbutils.fs.mv(downloaded_location, desired_location)
dbutils.fs.mv("file:/databricks/driver/my_file", file_path)
我希望这对你有帮助