如何将CSV文件从URL的压缩文件夹加载到Pandas DataFrame中



我想将CSV文件从URL的压缩文件夹加载到Pandas DataFrame中。我在这里提到并使用了如下相同的解决方案:

from urllib import request
import zipfile
# link to the zip file
link = 'https://cricsheet.org/downloads/'
# the zip file is named as ipl_csv2.zip
request.urlretrieve(link, 'ipl_csv2.zip')
compressed_file = zipfile.ZipFile('ipl_csv2.zip')
# I need the csv file named all_matches.csv from ipl_csv2.zip
csv_file = compressed_file.open('all_matches.csv')
data = pd.read_csv(csv_file)
data.head()

但在运行代码后,我得到了一个错误:

BadZipFile                                Traceback (most recent call last)
<ipython-input-3-7b7a01259813> in <module>
1 link = 'https://cricsheet.org/downloads/'
2 request.urlretrieve(link, 'ipl_csv2.zip')
----> 3 compressed_file = zipfile.ZipFile('ipl_csv2.zip')
4 csv_file = compressed_file.open('all_matches.csv')
5 data = pd.read_csv(csv_file)
~Anaconda3libzipfile.py in __init__(self, file, mode, compression, allowZip64, compresslevel, strict_timestamps)
1267         try:
1268             if mode == 'r':
-> 1269                 self._RealGetContents()
1270             elif mode in ('w', 'x'):
1271                 # set the modified flag so central directory gets written
~Anaconda3libzipfile.py in _RealGetContents(self)
1334             raise BadZipFile("File is not a zip file")
1335         if not endrec:
-> 1336             raise BadZipFile("File is not a zip file")
1337         if self.debug > 1:
1338             print(endrec)
BadZipFile: File is not a zip file

我不太习惯Python中的zip文件处理。所以,请在这里帮助我,我需要对代码进行哪些更正?

如果我在web浏览器中打开URLhttps://cricsheet.org/downloads/ipl_csv2.zip,zip文件会自动下载到我的系统中。由于数据每天都在这个zip文件中添加,我想访问URL并通过Python直接获取CSV文件以保存存储。

第1版:如果你们有其他代码解决方案,请分享。。。

这是我在下面与@nobleknight讨论后所做的:

# importing libraries
import zipfile
from urllib.request import urlopen
import shutil
import os
url = 'https://cricsheet.org/downloads/ipl_csv2.zip'
file_name = 'ipl_csv2.zip'
# extracting zipfile from URL
with urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
# extracting required file from zipfile
with zipfile.ZipFile(file_name) as zf:
zf.extract('all_matches.csv')
# deleting the zipfile from the directory
os.remove('ipl_csv2.zip')
# loading data from the file
data = pd.read_csv('all_matches.csv')

这个解决方案防止了我在网络中找到的每个解决方案所面临的ContentTooShortErrorHTTPForbiddenError错误。感谢@nobleknight为我提供了一部分解决方案。

欢迎有其他想法。

试试这个:

link = "https://cricsheet.org/downloads/ipl_csv2.zip"

如果文件被下载了,不用担心,如果您不想要那个文件,请取消下载。您将始终从link获得更新的数据。

最新更新