Python XPath循环遍历段落并抓取<strong>



我有一系列的段落,我试图使用xpath解析。html的格式如下:

<div id="content_third">
 <h3>Title1</h3>
 <p>
  <strong>District</strong>
  John Q Public <br>
  Susie B Private 
 <p>
 <p>
  <strong>District</strong>
  Anna C Public <br>
  Bob J Private 
 <p>
 <h3>Title1</h3>
 <p>
  <strong>District</strong>
  John Q Public <br>
  Susie B Private 
 <p>
 <p>
  <strong>District</strong>
  Anna C Public <br>
  Bob J Private 
 <p>
</div>

我设置了一个像这样的初始循环:

titles = tree.xpath('//*[@id="content_third"]/h3')
for num in range(len(titles):

然后是内循环:

district_races = tree.xpath('//*[@id="content_third"]/p[count(preceding-sibling::h3)={0}]'.format(num))
for index in range(len(district_races)):

每个循环,我只想选择strong中的"District"。我试过这个,它吐出空数组,除了一个填充了所有区域的数组:

zone = tree.xpath('//*[@id="content_third"]/p[count(preceding-sibling::h3)={0}/strong[{1}]/text()'.format(num, index))

我喜欢那些没有格式的州选举网页。

我认为每个District是一个实际名称的占位符,所以要获得每个District比你想做的要简单得多,只需从每个strong中提取文本:

h = """<div id="content_third">
 <h3>Title1</h3>
 <p>
  <strong>District</strong>
  John Q Public <br>
  Susie B Private
 <p>
 <p>
  <strong>District</strong>
  Anna C Public <br>
  Bob J Private
 <p>
 <h3>Title1</h3>
 <p>
  <strong>District</strong>
  John Q Public <br>
  Susie B Private
 <p>
 <p>
  <strong>District</strong>
  Anna C Public <br>
  Bob J Private
 <p>
</div>"""
from lxml import html
tree = html.fromstring(h)
print(tree.xpath('//*[@id="content_third"]/p/strong/text()'))

相关内容

最新更新