如何更改我的正则表达式，使其正确应用于我尝试抓取的 URL？

我正在使用Selenium，我的代码如下：

import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
driver = webdriver.Firefox()
omegaBase = "https://www.omegawatches.com/de/"          
productRegex = re.compile(r'[https://](w){3}')
driver.get(omegaBase + "watches/" + "constellation")
links = driver.find_elements_by_tag_name("a")
for link in links:
pageUrls = link.get_attribute("href")
print(pageUrls)
productRegex.findall(pageUrls)

如果我注释掉regEx，只注释print(pageUrls)，我会得到页面上的所有链接，这很好，但我试图从页面中只选择https://www.omegawatches.com/de/watch/name_of_product格式的几个特定链接

我使用正则表达式不是很好，我肯定需要练习和学习更多，但我一直在玩，只是想看看它是否适用，我一直收到错误TypeError: expected string or bytes-like object

有人知道我如何修复regEx，以便至少正确应用它吗？我在上面的例子中使用的regEx只是删除了几个链接，所以我可以看到它至少在工作。

您不需要regex来执行您正在尝试的操作。您可以使用一个简单的CSS选择器。

a[href^='https://www.omegawatches.com/de/watches/']

这只是查找一个A标记，该标记的href以您想要的URL开头。

您可以进一步修改它以关注特定的链接，比如只关注页脚中的手表链接，例如

div.footer-main-table a[href^='https://www.omegawatches.com/de/watches/']

等等

首先，让我们看看您的正则表达式。你这样做：

productRegex = re.compile(r'[https://](w){3}')

当您构建正则表达式时，方括号中的内容与它所包含的一组字符相匹配。例如，[aeiou]仅与a、e、i、o或u匹配。这里你想匹配字符串https://，所以把它放在没有方括号的地方：

productRegex = re.compile(r'https://(w){3}')

您可以通过使用^来匹配表达式的开头来进一步更改它，并将(w){3}简化为www:

productRegex = re.compile(r'^https://www')

现在让我们看看如何使用正则表达式：

for link in links:
pageUrls = link.get_attribute("href")
print(pageUrls)
productRegex.findall(pageUrls)

在这里，您可以使用get_attribute()获取链接的URL。这得到了一个URL，所以我建议将变量名从pageUrls更改为pageUrl。然后，您需要检查URL是否与正则表达式匹配，您可以这样做：

if productRegex.match(pageUrl):
print(pageUrl)
else:
print('No match')

(当然，现在我们已经走到了这一步，我们注意到，如果我们使用的是仅在字符串开头查找匹配项的match()，则正则表达式中不需要^。(

相关内容

最新更新

热门标签：