从页面提取链接并检查域的脚本



我正在尝试编写一个脚本,该脚本遍历网页列表,从每个页面提取链接,并检查每个链接以查看是否在给定的域集中。我已经设置了编写两个文件的脚本——在给定域中具有链接的页面被写入一个文件,而其余的则被写入另一个文件。我基本上是在尝试根据页面中的链接对页面进行排序。下面是我的剧本,但看起来不对。我很感激任何实现这一目标的建议(你能告诉我,我是新来的(

import requests
from bs4 import BeautifulSoup
import re

urls = ['https://www.rose.com', 'https://www.pink.com']
for i in range(len(urls)):
grab = requests.get(urls[i])
soup = BeautifulSoup(grab.text, 'html.parser')
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for link in soup.find_all('a'):
data = link.get('href')
check_url = re.compile(r'(www.x.com)+ | (www.y.com)')
invalid = check_url.search(data)
if invalid == None
g.write(urls[i])
g.write('n')
else:
f.write(urls[i])
f.write('n')

您的代码存在一些非常基本的问题:

  • if invalid == None结尾缺少:,但也应该是if invalid is None:
  • 并不是所有的<a>元素都有href,所以您需要处理这些元素,否则您的脚本将失败
  • regex有一些问题(您可能不想重复第一个URL,括号毫无意义(
  • 每次发现问题时,您都会将URL写入文件,但只有当文件出现问题时,才需要将其写入文件;或者你想要所有有问题的链接的完整列表
  • for循环的每一次迭代中都要重写文件,所以只能得到最终结果

修复所有这些(并使用一些可用的任意URL(:

import requests
from bs4 import BeautifulSoup
import re

urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for i in range(len(urls)):
grab = requests.get(urls[i])
soup = BeautifulSoup(grab.text, 'html.parser')
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
check_url = re.compile('gamespot.com|pcgamer.com')
result = check_url.search(data)
if result is None:
# if there's no result, the link doesn't match what we need, so write it and stop searching
g.write(urls[i])
g.write('n')
break
else:
f.write(urls[i])
f.write('n')

然而,仍然有很多问题:

  • 如果打开文件句柄,但从不关闭它们,请使用with
  • 使用索引循环列表,这是不需要的,直接循环urls
  • 您编译正则表达式是为了提高效率,但在每次迭代时都要这样做,以抵消影响

修复了这些问题的相同代码:

import requests
from bs4 import BeautifulSoup
import re

urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
check_url = re.compile('gamespot.com|pcgamer.com')
for url in urls:
grab = requests.get(url)
soup = BeautifulSoup(grab.text, 'html.parser')
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
result = check_url.search(data)
if result is None:
# if there's no result, the link doesn't match what we need, so write it and stop searching
g.write(url)
g.write('n')
break
else:
f.write(url)
f.write('n')

或者,如果你想列出网站上所有有问题的URL:

import requests
from bs4 import BeautifulSoup
import re

urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
check_url = re.compile('gamespot.com|pcgamer.com')
for url in urls:
grab = requests.get(url)
soup = BeautifulSoup(grab.text, 'html.parser')
good = True
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
result = check_url.search(data)
if result is None:
g.write(f'{url},{data}n')
good = False
if good:
f.write(url)
f.write('n')

最新更新