使用 Beautifulsoup 和 Selenium 解析来自 JavaScript 驱动页面的 URL

我想解析Git存储库中所有出现电子邮件的URL。我使用https://grep.app

代码：

from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://grep.app/search?current=100&q=%40gmail.com'
chrome = "/home/dev/chromedriver"
browser = webdriver.Chrome(executable_path=chrome)
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
tags = soup.select('a')
print(tags)

当代码启动时，Chrome启动并加载带有结果的页面，在Chrome的开发者工具中，在源代码中，我可以看到很多URL的a和HREF。来源于页面

喜欢：lib/plugins/reverse/lang/eu/lang.php

但我的代码只返回"；标签"；来自页脚：

"[<a href="/"><span class="slashes">//</span>grep.app</a>, <a href="mailto:hello@grep.app">Contact</a>]"

正如我所理解的JS解析有问题。请告诉我做错了什么？

代码：

from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://grep.app/search?current=100&q=%40gmail.com'
chrome = "/home/dev/chromedriver"
browser = webdriver.Chrome(executable_path=chrome)
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
links = []
tags = soup.find_all('a', href=True)
for tag in tags:
links.append(tag['href'])

print(links)

输出：

['/', 'mailto:hello@grep.app']

相关内容

最新更新

热门标签：