scropy:如何获取标题后面的所有段落



我想提取所有<p>带有标题的标记文本。

<html>
<head>
<title>My Page</title>
</head>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
<h2>My Second Heading</h2>
<p>My Second paragraph.</p>
<h3>My Third Heading</h3>
<a> There might be something else in middle </a>
<p>My Third paragraph.</p>
<p>My fourth paragraph.</p>
<p>My fifth paragraph.</p>
<p>My sixth paragraph.</p>
</body>
</html>

我想提取所有<p>像这样标记标题后面的文本,忽略没有标题的文本。

["My first paragraph", "My second paragraph", "My third paragraph"]

这:

response.xpath("//*[starts-with(name(), 'h')]/following-sibling::p[1]/text()").getall()

将返回:

['My first paragraph.', 'My Second paragraph.', 'My Third paragraph.']

最新更新