如何在python中循环浏览url列表以进行web抓取



对于python来说非常陌生,并且正在与这个循环作斗争。我正试图从一个静态页面列表中提取html属性数据地址,我已经有了列表格式的页面。我已经设法使用BS4从一个页面中提取数据,但我无法正确地循环遍历我的URL列表。现在我收到了这个错误(无效的URL"0":没有提供架构。也许你的意思是http://0?(,但我在单个pull中检查了URL,它们都能工作。这是我的工作单拉码:

import requests
from bs4 import BeautifulSoup
result = requests.get('https://www.coingecko.com/en/coins/0xcharts')
src = result.content
soup = BeautifulSoup(src, 'lxml')
contract_address = soup.find(
'i', attrs={'data-title': 'Click to copy'})
print(contract_address.attrs['data-address'])

这是我正在处理的循环:

import requests
from bs4 import BeautifulSoup
url_list = ['https://www.coingecko.com/en/coins/2goshi','https://www.coingecko.com/en/coins/0xcharts']
for link in range(len(url_list)):
result = requests.get(link)
src = result.content
soup = BeautifulSoup(src, 'lxml')
contract_address = soup.find(
'i', attrs={'data-title': 'Click to copy'})
print(contract_address.attrs['data-address'])
url_list.seek(0)

试试看。

import requests
from bs4 import BeautifulSoup
url_list = ['https://www.coingecko.com/en/coins/2goshi','https://www.coingecko.com/en/coins/0xcharts']
for link in url_list:
result = requests.get(link)
src = result.content
soup = BeautifulSoup(src, 'lxml')
contract_address = soup.find(
'i', attrs={'data-title': 'Click to copy'})
print(contract_address.attrs['data-address'])
url_list.seek(0)

您误解了range()的用法。请阅读文档。

当你这样做:

result = requests.get(link)

link是来自range()int值,看看print(link)会发生什么。相反,按如下方式访问列表url_list

result = requests.get(url_list[link])

下面是一个完整的例子:

import requests
from bs4 import BeautifulSoup
url_list = ['https://www.coingecko.com/en/coins/2goshi','https://www.coingecko.com/en/coins/0xcharts']
for link in range(len(url_list)):

result = requests.get(url_list[link])
src = result.content
soup = BeautifulSoup(src, 'lxml')
contract_address = soup.find(
'i', attrs={'data-title': 'Click to copy'})
print(contract_address.attrs['data-address'])

输出:

0x70e132641d6f1bd787b119a289fee544fbb2f316
0x86dd49963fe91f0e5bc95d171ff27ea996c0890c

最新更新