我想解析Git存储库中所有出现电子邮件的URL。我使用https://grep.app
代码:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://grep.app/search?current=100&q=%40gmail.com'
chrome = "/home/dev/chromedriver"
browser = webdriver.Chrome(executable_path=chrome)
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
tags = soup.select('a')
print(tags)
当代码启动时,Chrome启动并加载带有结果的页面,在Chrome的开发者工具中,在源代码中,我可以看到很多URL的a和HREF。来源于页面
喜欢:lib/plugins/reverse/lang/eu/lang.php
但我的代码只返回";标签";来自页脚:
"[<a href="/"><span class="slashes">//</span>grep.app</a>, <a href="mailto:hello@grep.app">Contact</a>]"
正如我所理解的JS解析有问题。请告诉我做错了什么?
代码:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://grep.app/search?current=100&q=%40gmail.com'
chrome = "/home/dev/chromedriver"
browser = webdriver.Chrome(executable_path=chrome)
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
links = []
tags = soup.find_all('a', href=True)
for tag in tags:
links.append(tag['href'])
print(links)
输出:
['/', 'mailto:hello@grep.app']