使用BeautifulSoup删除多个URL

我正在尝试抓取一个网站，但是，我无法完成代码，以便一次插入多个URL。目前，该代码一次只能使用一个URL，

当前代码为：

import requests
from bs4 import BeautifulSoup
import lxml
import pandas as pd
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
try:
html = urlopen("http://google.com")
except HTTPError as e:
print(e)
except URLError:
print("error")
else:
res = BeautifulSoup(html.read(),"html5lib")
tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
title = res.title.text
print(title)
for tag in tags:
print(tag)

有人能帮我修改一下吗？这样我就可以插入这样的东西了？

html = urlopen ("url1, url2, url3")

将代码的可重复部分封装在函数中，并使用列表：

def urlhelper(x):
for ele in x:
try:
html = urlopen(ele)
except HTTPError as e:
print(e)
except URLError:
print("error")
else:
res = BeautifulSoup(html.read(),"html5lib")
tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
title = res.title.text
print(title)
for tag in tags:
print(tag)

使用urlhelper(["url1"、"url2"、"etc"](调用此函数

这里要理解的关键概念是"；对于"；它告诉python对列表中的每个元素进行迭代。

我建议阅读迭代器和列表以了解更多信息：

https://www.w3schools.com/python/python_lists.asp

https://www.w3schools.com/python/python_iterators.asp

您可以创建一个url列表，并使用如下的for循环进行循环：

import requests
from bs4 import BeautifulSoup
import lxml
import pandas as pd
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
urlList = ["url1", "url2", "url3", "url4"]
for url in urlList:
try:
html = urlopen(url)
except HTTPError as e:
print(e)
except URLError:
print("error")
else:
res = BeautifulSoup(html.read(),"html5lib")
tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
title = res.title.text
print(title)
for tag in tags:
print(tag)

相关内容

最新更新

热门标签：