无法使用请求模块从网页中抓取电子邮件地址



我正试图使用请求模块而不是selenium从该网页中抓取电子邮件地址。尽管电子邮件地址被混淆,并且不存在于页面源中,但javascript函数会生成此地址。如何使用以下部分使电子邮件地址在该网页中可见?

document.write("u003cn uers="znvygb:gnneba@zbsb.pbz"u003egnneba@zbsb.pbzu003c/nu003e".replace(/[a-zA-Z]/g, function(c){return String.fromCharCode((c<="Z"?90:122)>=(c=c.charCodeAt(0)+13)?c:c-26);}));

到目前为止,我已经尝试过:

import requests
from bs4 import BeautifulSoup
link = 'https://www.californiatoplawyers.com/lawyer/311805/tobyn-yael-aaron'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
email = soup.select_one("dt:-soup-contains('Email') + dd")
print(email)

预期输出:

taaron@mofo.com

对于这些任务,我推荐js2py模块:

import js2py
import requests
from bs4 import BeautifulSoup
link = "https://www.californiatoplawyers.com/lawyer/311805/tobyn-yael-aaron"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}
res = requests.get(link, headers=headers)
soup = BeautifulSoup(res.text, "html.parser")
email = soup.select_one("dt:-soup-contains('Email') + dd")
js_code = email.script.contents[0].replace("document.write", "")
email = BeautifulSoup(js2py.eval_js(js_code), "html.parser").text
print(email)

打印:

taaron@mofo.com

最新更新