从页面提取链接并检查域的脚本

我正在尝试编写一个脚本，该脚本遍历网页列表，从每个页面提取链接，并检查每个链接以查看是否在给定的域集中。我已经设置了编写两个文件的脚本——在给定域中具有链接的页面被写入一个文件，而其余的则被写入另一个文件。我基本上是在尝试根据页面中的链接对页面进行排序。下面是我的剧本，但看起来不对。我很感激任何实现这一目标的建议(你能告诉我，我是新来的(

import requests
from bs4 import BeautifulSoup
import re

urls = ['https://www.rose.com', 'https://www.pink.com']
for i in range(len(urls)):
grab = requests.get(urls[i])
soup = BeautifulSoup(grab.text, 'html.parser')
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for link in soup.find_all('a'):
data = link.get('href')
check_url = re.compile(r'(www.x.com)+ | (www.y.com)')
invalid = check_url.search(data)
if invalid == None
g.write(urls[i])
g.write('n')
else:
f.write(urls[i])
f.write('n')

您的代码存在一些非常基本的问题：

if invalid == None结尾缺少:，但也应该是if invalid is None:
并不是所有的<a>元素都有href，所以您需要处理这些元素，否则您的脚本将失败
regex有一些问题(您可能不想重复第一个URL，括号毫无意义(
每次发现问题时，您都会将URL写入文件，但只有当文件出现问题时，才需要将其写入文件；或者你想要所有有问题的链接的完整列表
在for循环的每一次迭代中都要重写文件，所以只能得到最终结果

修复所有这些(并使用一些可用的任意URL(：

import requests
from bs4 import BeautifulSoup
import re

urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for i in range(len(urls)):
grab = requests.get(urls[i])
soup = BeautifulSoup(grab.text, 'html.parser')
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
check_url = re.compile('gamespot.com|pcgamer.com')
result = check_url.search(data)
if result is None:
# if there's no result, the link doesn't match what we need, so write it and stop searching
g.write(urls[i])
g.write('n')
break
else:
f.write(urls[i])
f.write('n')

然而，仍然有很多问题：

如果打开文件句柄，但从不关闭它们，请使用with
使用索引循环列表，这是不需要的，直接循环urls
您编译正则表达式是为了提高效率，但在每次迭代时都要这样做，以抵消影响

修复了这些问题的相同代码：

import requests
from bs4 import BeautifulSoup
import re

urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
check_url = re.compile('gamespot.com|pcgamer.com')
for url in urls:
grab = requests.get(url)
soup = BeautifulSoup(grab.text, 'html.parser')
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
result = check_url.search(data)
if result is None:
# if there's no result, the link doesn't match what we need, so write it and stop searching
g.write(url)
g.write('n')
break
else:
f.write(url)
f.write('n')

或者，如果你想列出网站上所有有问题的URL：

import requests
from bs4 import BeautifulSoup
import re

urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
check_url = re.compile('gamespot.com|pcgamer.com')
for url in urls:
grab = requests.get(url)
soup = BeautifulSoup(grab.text, 'html.parser')
good = True
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
result = check_url.search(data)
if result is None:
g.write(f'{url},{data}n')
good = False
if good:
f.write(url)
f.write('n')

相关内容

最新更新

热门标签：