我正在尝试编写一个脚本,该脚本遍历网页列表,从每个页面提取链接,并检查每个链接以查看是否在给定的域集中。我已经设置了编写两个文件的脚本——在给定域中具有链接的页面被写入一个文件,而其余的则被写入另一个文件。我基本上是在尝试根据页面中的链接对页面进行排序。下面是我的剧本,但看起来不对。我很感激任何实现这一目标的建议(你能告诉我,我是新来的(
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.rose.com', 'https://www.pink.com']
for i in range(len(urls)):
grab = requests.get(urls[i])
soup = BeautifulSoup(grab.text, 'html.parser')
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for link in soup.find_all('a'):
data = link.get('href')
check_url = re.compile(r'(www.x.com)+ | (www.y.com)')
invalid = check_url.search(data)
if invalid == None
g.write(urls[i])
g.write('n')
else:
f.write(urls[i])
f.write('n')
您的代码存在一些非常基本的问题:
if invalid == None
结尾缺少:
,但也应该是if invalid is None:
- 并不是所有的
<a>
元素都有href
,所以您需要处理这些元素,否则您的脚本将失败 - regex有一些问题(您可能不想重复第一个URL,括号毫无意义(
- 每次发现问题时,您都会将URL写入文件,但只有当文件出现问题时,才需要将其写入文件;或者你想要所有有问题的链接的完整列表
- 在
for
循环的每一次迭代中都要重写文件,所以只能得到最终结果
修复所有这些(并使用一些可用的任意URL(:
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for i in range(len(urls)):
grab = requests.get(urls[i])
soup = BeautifulSoup(grab.text, 'html.parser')
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
check_url = re.compile('gamespot.com|pcgamer.com')
result = check_url.search(data)
if result is None:
# if there's no result, the link doesn't match what we need, so write it and stop searching
g.write(urls[i])
g.write('n')
break
else:
f.write(urls[i])
f.write('n')
然而,仍然有很多问题:
- 如果打开文件句柄,但从不关闭它们,请使用
with
- 使用索引循环列表,这是不需要的,直接循环
urls
- 您编译正则表达式是为了提高效率,但在每次迭代时都要这样做,以抵消影响
修复了这些问题的相同代码:
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
check_url = re.compile('gamespot.com|pcgamer.com')
for url in urls:
grab = requests.get(url)
soup = BeautifulSoup(grab.text, 'html.parser')
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
result = check_url.search(data)
if result is None:
# if there's no result, the link doesn't match what we need, so write it and stop searching
g.write(url)
g.write('n')
break
else:
f.write(url)
f.write('n')
或者,如果你想列出网站上所有有问题的URL:
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
check_url = re.compile('gamespot.com|pcgamer.com')
for url in urls:
grab = requests.get(url)
soup = BeautifulSoup(grab.text, 'html.parser')
good = True
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
result = check_url.search(data)
if result is None:
g.write(f'{url},{data}n')
good = False
if good:
f.write(url)
f.write('n')