如何从beautifulsoup输出python中读取链接

我正在尝试传递从beautifulsoup中提取的链接。

import requests
r = requests.get('https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links[1])

这是我想要的链接。

输出:https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_07202021.zip

现在我正在尝试通过这个链接，以便我可以下载内容。


# make a folder if it doesn't already exist
if not os.path.exists(folder_name):
    os.makedirs(folder_name)
# pass the url
url = r'link from beautifulsoup result needs to go here'
response = requests.get(url, stream = True)
# extract contents
with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
    for elem in zf.namelist():
        zf.extract(elem, '../data')

我的总体目标是试图采取的链接，我网络抓取和把它放在url变量，因为链接总是在这个网站上的变化。我想使它动态，所以我不必手动搜索这个链接，并改变它时，它的变化，而不是动态变化。我希望这是有意义的，并感谢任何帮助我可以得到。

如果我像下面这样手动输入我的代码，我知道它可以工作

url = r'https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_07202021.zip'

如果我能让我的代码准确地传递，我知道它会工作，我只是被如何完成它所困扰。

我认为您可以使用Beautiful Soup中的find_all()方法来实现它

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content)
for a in soup.find_all('a'):
    url = a.get('href')

相关内容

最新更新

热门标签：