如何使用python提取动态html内容



我正在尝试提取产品技术属性。产品有时可以是电气的、机械的或其他的。这是具有技术属性和值的电气产品详细信息的示例

<section>
<div class="columns">
<div class="column">
<div class="message is-primary">
<header class="message-header">
<h4>Technical Characteristics</h4>
</header>
<div class="message-body">
<dl class="dl-horizontal">
<dt>ELECTRICAL RESISTANCE</dt>
<dd>(AAPP) 3.300 MEGOHMS</dd>
<dt>AMBIENT TEMP IN DEG CELSIUS AT FULL RATED POWER</dt>
<dd>(AAQF) 70.0</dd>
<dt>RESISTANCE TOLERANCE IN PERCENT</dt>
<dd>(AAPQ) -5.000/+5.000</dd><dt>POWER DISSIPATION RATING IN WATTS</dt>
<dd>(AEFB) 0.250 FREE AIR</dd><dt>STYLE DESIGNATOR</dt>

<dd>(TEST) 81349-MIL-R-11/8 SPECIFICATION (INCLUDES ENGINEERINGIONS THAT ARE SHOWN AS "TYPICAL", "AVERAGE", "NOMINAL", ETC.).</dd>
</dl>
</div>
</div>
</div>
</div>
</section>

我可以使用此python脚本来提取电气属性键和值

productsoup = BeautifulSoup(productdriver.page_source,"lxml");
try:
for li in productsoup.find_all('dt',text=re.compile('^(ELECTRICAL RESISTANCE)|^(AMBIENT TEMP)|^(RESISTANCE TOLERANCE)|^(DISSIPATION)')):

但有时机械产品可以有这种格式

<section>
<div class="columns">
<div class="column">
<div class="message is-primary">
<header class="message-header">
<h4>Technical Characteristics</h4>
</header>
<div class="message-body">
<dl class="dl-horizontal">
<dt>END ITEM IDENTIFICATION</dt>
<dd>(AGAV) END ITEM 6675014301965</dd><dt>BODY STYLE</dt>
<dd>(AAQL) TUBE TYPE</dd><dt>CONTINUOUS CURRENT RATING IN AMPS</dt>
<dd>(AEBJ) 1.600</dd><dt>III END ITEM IDENTIFICATION</dt>
<dd>(AGAV) END ITEM 6675014301965</dd>
</dl>
</div>
</div>
</div>
</div>
</section>

如何提取技术属性(dt(和相应的值(dd(?

你可以尝试这样的事情:

from bs4 import BeautifulSoup
html = """<section>
<div class="columns">
<div class="column">
<div class="message is-primary">
<header class="message-header">
<h4>Technical Characteristics</h4>
</header>
<div class="message-body">
<dl class="dl-horizontal">
<dt>END ITEM IDENTIFICATION</dt>
<dd>(AGAV) END ITEM 6675014301965</dd>
<dt>BODY STYLE</dt>
<dd>(AAQL) TUBE TYPE</dd>
<dt>CONTINUOUS CURRENT RATING IN AMPS</dt>
<dd>(AEBJ) 1.600</dd>
<dt>III END ITEM IDENTIFICATION</dt>
<dd>(AGAV) END ITEM 6675014301965</dd>
</dl>
</div>
</div>
</div>
</div>
</section>"""
soup = BeautifulSoup(html, 'html.parser')
dts = soup.find_all("dt")
outs = {i.string: i.find_next("dd").string for i in dts}
print(outs)
#> {'END ITEM IDENTIFICATION': '(AGAV) END ITEM 6675014301965', 'BODY STYLE': '(AAQL) TUBE TYPE', 'CONTINUOUS CURRENT RATING IN AMPS': '(AEBJ) 1.600', 'III END ITEM IDENTIFICATION': '(AGAV) END ITEM 6675014301965'}

创建于 2018-09-28 由 reprexpy 软件包

import reprexpy
print(reprexpy.SessionInfo())
#> Session info --------------------------------------------------------------------
#> Platform: Darwin-17.7.0-x86_64-i386-64bit (64-bit)
#> Python: 3.6
#> Date: 2018-09-28
#> Packages ------------------------------------------------------------------------
#> beautifulsoup4==4.6.3
#> reprexpy==0.1.1

最新更新