使用Selenium解析某些"html elements"的文本

到目前为止，我所看到的是，如果网页的页面源代码被Selenium过滤，那么无论页面源代码是否启用了JavaScript，都可以从该页面源代码中解析文本或必要的内容，应用BS4或lxml。但是，我的问题是如何通过过滤硒然后使用 bs4 或 lxml 库来解析某个html elements的文档。如果考虑以下粘贴的元素，则应用 BS4 或 lXML 的方式是：

html='''
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue';
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;">
        <td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td>
</tr>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
#rest of the code here
from lxml.html import fromstring
tree = fromstring(html)           
#rest of the code here

现在，如何使用硒过滤上述粘贴html部分，然后在其上应用 bs4 库？想不出driver.page_source因为它仅适用于从网页过滤时。

更具体一点，如果我想使用下面这样的东西，那怎么可能呢？

from selenium import webdriver
driver = webdriver.Chrome()
element_html = driver-------(html)  #this "html" is the above pasted one
print(element_html)

driver.page_source会在某个特定时刻为您提供页面的完整HTML源代码。但是，您有一个元素实例，可以使用.get_attribute()方法outerHTML它：

element = driver.find_element_by_id("some_id")
element_html = element.get_attribute("outerHTML")
soup = BeautifulSoup(element_html, "lxml")

至于从 mouseover 属性中提取 span 元素源 - 我将首先使用 BeautifulSoup 解析 tr 元素，获取 onmouseover 属性，然后使用正则表达式从 Tip() 函数调用中提取 html 值。然后，使用 BeautifulSoup 重新解析 span html

：

import re
from bs4 import BeautifulSoup
html='''
<tr onmouseover="this.originalstyle=this.style.backgroundColor;this.style.backgroundColor='DodgerBlue';
this.originalcolor=this.style.color;this.style.color='White';Tip('<span Style=Color:Red>License: <BR />20-214767 (Validity: 21/05/2022)<BR />20C-214769 (Validity: 21/05/2022)<BR />21-214768 (Validity: 21/05/2022)</span>');" onmouseout="this.style.backgroundColor=this.originalstyle;this.style.color=this.originalcolor;UnTip();" style="background-color:White;font-family:Times New Roman;font-size:12px;">
        <td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">AAYUSH PHARMA</td><td style="font-size:10px;font-weight:normal;font-style:normal;text-decoration:none;" align="left">PUNE-1ST FLOOR, SR.NO.742/A, DINSHOW APARTMENT,,SWAYAM HOSPITAL AND NURSING HOME, BHAWANI PETH</td><td style="font-weight:normal;font-style:normal;text-decoration:none;" align="center">RH - 3</td><td>swapnil ramakant pawar, BPH, [140514-21/04/2017]</td>
</tr>
'''
soup = BeautifulSoup(html, "lxml")
mouse_over = soup.tr['onmouseover']
span = re.search(r"Tip('(.*?)')", mouse_over).group(1)
span_soup = BeautifulSoup(span, "lxml")
print(span_soup.get_text())

指纹：

License: 20-214767 (Validity: 21/05/2022)20C-214769 (Validity: 21/05/2022)21-214768 (Validity: 21/05/2022)

相关内容

最新更新

热门标签：