在两个h2标签之间使用以下同级的Scrapy-xpath



我有一个设计糟糕的HTML页面,我试图使用scrapy从中提取数据。以下是我感兴趣的片段:

<html>
<h2 class="schoolName">Graduate School of Business</h2>
<ul title="Graduate School of Business departments - part 1"></ul>
<ul title="Graduate School of Business departments - part 2"></ul>
<ul title="Graduate School of Business departments - part 3"></ul>
<h2 class="schoolName">School of Law</h2>
<ul title="School of Law departments - part 1"></ul>
<ul title="School of Law departments - part 2"></ul>
<h2 class="schoolName">School of Medicine</h2>
<ul title="School of Medicine departments - part 1"></ul>
</html>

我特别想知道每所学校的数量和所属部门的数量。因此,我发现所有学校的列表如下:

>>> schools = response.xpath('//h2[@class="schoolName"]/text()').getall()
>>> schools
['Graduate School of Business', 'School of Law', 'School of Medicine']

然后,对于每一所学校,我发现它们下面的部门如下:

>>> for school in schools:
...     print(school)
...     print(response.xpath(f'//h2[@class="schoolName"][text()[contains(.,"{school}")]]/following-sibling::ul/@title').extract())
...     print ("-----------------------------")
...
Graduate School of Business
['Graduate School of Business departments - part 1', 'Graduate School of Business departments - part 
2', 'Graduate School of Business departments - part 3', 'School of Law departments - part 1', 
'School of Law departments - part 2', 'School of Medicine departments - part 1']
-----------------------------
School of Law
['School of Law departments - part 1', 'School of Law departments - part 2', 'School of Medicine 
departments - part 1']
-----------------------------
School of Medicine
['School of Medicine departments - part 1']
-----------------------------

这显然没有按预期工作,因为下面的同级选择了所有ul标记,而不仅仅是两个h2标签之间的标记。我该如何做到这一点?

一种技术是选择一个标记新信息块开始的公共除法器元素,使用count()preceding-sibling测量其位置,然后选择具有相同数量(加一(除法器前同级的所有数据元素。

在iPython外壳中:

In [1]: from lxml import etree
In [2]: string = '''<html>
...:     <h2 class="schoolName">Graduate School of Business</h2>
...:         <ul title="Graduate School of Business departments - part 1"></ul>
...:         <ul title="Graduate School of Business departments - part 2"></ul>
...:         <ul title="Graduate School of Business departments - part 3"></ul>
...:    <h2 class="schoolName">School of Law</h2>
...:        <ul title="School of Law departments - part 1"></ul>
...:        <ul title="School of Law departments - part 2"></ul>
...:   <h2 class="schoolName">School of Medicine</h2>
...:       <ul title="School of Medicine departments - part 1"></ul>
...: </html>'''
In [3]: root = etree.fromstring(string)
In [4]: schools = root.xpath('//h2[@class="schoolName"]/text()')
In [5]: schools
Out[5]: ['Graduate School of Business', 'School of Law', 'School of Medicine']
In [6]: for school in schools:
...:     print (school)
...:     position = int(root.xpath(f'count(//h2[text()="{school}"]/preceding-sibling::h2) + 1'))
...:     print (f"Position: {position}")
...:     print (root.xpath(f'//ul[count(preceding-sibling::h2) = {position}]/@title'))
...: 
Graduate School of Business
Position: 1
['Graduate School of Business departments - part 1', 'Graduate School of Business departments - part 2', 'Graduate School of Business departments - part 3']
School of Law
Position: 2
['School of Law departments - part 1', 'School of Law departments - part 2']
School of Medicine
Position: 3
['School of Medicine departments - part 1']

最新更新