我正在尝试抓取一个网站,但是,我无法完成代码,以便一次插入多个URL。目前,该代码一次只能使用一个URL,
当前代码为:
import requests
from bs4 import BeautifulSoup
import lxml
import pandas as pd
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
try:
html = urlopen("http://google.com")
except HTTPError as e:
print(e)
except URLError:
print("error")
else:
res = BeautifulSoup(html.read(),"html5lib")
tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
title = res.title.text
print(title)
for tag in tags:
print(tag)
有人能帮我修改一下吗?这样我就可以插入这样的东西了?
html = urlopen ("url1, url2, url3")
将代码的可重复部分封装在函数中,并使用列表:
def urlhelper(x):
for ele in x:
try:
html = urlopen(ele)
except HTTPError as e:
print(e)
except URLError:
print("error")
else:
res = BeautifulSoup(html.read(),"html5lib")
tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
title = res.title.text
print(title)
for tag in tags:
print(tag)
使用urlhelper(["url1"、"url2"、"etc"](调用此函数
这里要理解的关键概念是";对于";它告诉python对列表中的每个元素进行迭代。
我建议阅读迭代器和列表以了解更多信息:
https://www.w3schools.com/python/python_lists.asp
https://www.w3schools.com/python/python_iterators.asp
您可以创建一个url列表,并使用如下的for循环进行循环:
import requests
from bs4 import BeautifulSoup
import lxml
import pandas as pd
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
urlList = ["url1", "url2", "url3", "url4"]
for url in urlList:
try:
html = urlopen(url)
except HTTPError as e:
print(e)
except URLError:
print("error")
else:
res = BeautifulSoup(html.read(),"html5lib")
tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
title = res.title.text
print(title)
for tag in tags:
print(tag)