使用Beautifulsoup下载文件时出错



我正在尝试使用Beautifulsoup从免费数据集下载一些文件。我对网页中的两个类似链接重复相同的过程。

这是页面地址。

import requests
from bs4 import BeautifulSoup
first_url = "http://umcd.humanconnectomeproject.org/umcd/default/download/upload_data.region_xyz_centers_file.bcf53cd53a90f374.55434c415f43434e5f41504f455f4454495f41504f452d335f355f726567696f6e5f78797a5f63656e746572732e747874.txt" 
second_url="http://umcd.humanconnectomeproject.org/umcd/default/download/upload_data.connectivity_matrix_file.bfcc4fb8da90e7eb.55434c415f43434e5f41504f455f4454495f41504f452d335f355f636f6e6e6563746d61742e747874.txt"
# labeled as Connectivity Matrix File in the webpage
def download_file(url, file_name):
myfile = requests.get(url)
open(file_name, 'wb').write(myfile.content)
download_file(first_url, "file1.txt")
download_file(second_url, "file2.txt")

输出文件:

file1.txt:
50.118248 53.451775 39.279296 
51.417612 67.443649 41.009074 
...
file2.txt
<html><body><h1>Internal error</h1>Ticket issued: <a href="/admin/default/ticket/umcd/89.41.15.124.2020-04-30.01-59-18.c02951d4-2e85-4934-b2c1-28bce003d562" target="_blank">umcd/89.41.15.124.2020-04-30.01-59-18.c02951d4-2e85-4934-b2c1-28bce003d562</a></body><!-- this is junk text else IE does not display the page: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx //--></html>

但我可以从chrome浏览器正确下载second_url(包含一些数字(。我尝试设置用户代理

headers = {'User-Agent': "Chrome/6.0.472.63 Safari/534.3"}
r = requests.get(url, headers=headers)

但没有起作用。

编辑网站不需要登录即可获取数据。我在私人模式浏览器中打开了页面,然后在second_url中下载了文件。直接复制地址栏中的second_url出现错误:

Internal error
Ticket issued: umcd/89.41.15.124.2020-04-30.03-18-34.49c8cb58-7202-4f05-9706-3309b581af76

你知道吗?提前感谢您的指导。

这不是Python的问题。第二个URL在Curl和我的浏览器中都给出了相同的错误。

顺便说一下,第二个URL会更短,这对我来说很奇怪。你确定你抄对了吗?

最新更新