我如何使用Python 3和美丽的汤在两个评论之间进行html

我正在尝试将美丽的汤4用于刮擦项目，但只想在两个特定评论之间解析HTML，类似于此：

from bs4 import BeautifulSoup as bsoup
from bs4 import Comment    
html = """
<!-- Comment 1 -->
<p>
<a href="http://www.something.htm"><h4>Link</h4></a>
    Address: 123 1st St., NYC 10001<br />
    Schools:<br />
    School Name 1<br />
    School Name 2<br />
    School Name 3<br />
</p>
<p>
    <a href="http://www.somethingelse.htm"><h4>Link</h4></a>
    Address: 456 2st St., NYC 10001<br />
    Schools:<br />
    School Name 4<br />
    School Name 5<br />
    School Name 6<br />
</p>
<!-- Comment 2 -->
"""

计划是在这些注释之间(忽略其他评论(之间使用所有<p>标签创建一个列表，然后遍历每个学校的内容的内容，以提取每个学校的<a>链接，地址和学校名称。p>但是首先，我只是想弄清楚如何将我的<p>标签列表限制为这些注释中的标签

我发现的最接近的是，这是不同的，因为它仅在特定评论后提取第一元素，但是我认为最终会以某种方式使用评论类。

您可以通过阅读body元素的子标签列表来做到这一点。

这是对代码的开始。

我已将<html><body>添加到您的html的开头，以便美丽的套件将您的其他html放在最终解析之前生成的内容中。
首先，我找到了body元素，然后我识别该元素的每个孩子并输出其顺序位置，类型和表示形式。输出遵循代码。

HTML = '''
<html><body>
<!-- Comment 1 -->
<p>
<a href="http://www.something.htm"><h4>Link</h4></a>
    Address: 123 1st St., NYC 10001<br />
    Schools:<br />
    School Name 1<br />
    School Name 2<br />
    School Name 3<br />
</p>
<p>
    <a href="http://www.somethingelse.htm"><h4>Link</h4></a>
    Address: 456 2st St., NYC 10001<br />
    Schools:<br />
    School Name 4<br />
    School Name 5<br />
    School Name 6<br />
</p>
<!-- Comment 2 -->
</body></html>'''
import bs4
soup = bs4.BeautifulSoup (HTML, 'lxml')
for c, child in enumerate(soup.find('body').children):
    print (c, type(child), 'n', child)

输出：

0 <class 'bs4.element.NavigableString'> 

1 <class 'bs4.element.Comment'> 
  Comment 1 
2 <class 'bs4.element.NavigableString'> 

3 <class 'bs4.element.Tag'> 
 <p>
<a href="http://www.something.htm"><h4>Link</h4></a>
    Address: 123 1st St., NYC 10001<br/>
    Schools:<br/>
    School Name 1<br/>
    School Name 2<br/>
    School Name 3<br/>
</p>
4 <class 'bs4.element.NavigableString'> 

5 <class 'bs4.element.Tag'> 
 <p>
<a href="http://www.somethingelse.htm"><h4>Link</h4></a>
    Address: 456 2st St., NYC 10001<br/>
    Schools:<br/>
    School Name 4<br/>
    School Name 5<br/>
    School Name 6<br/>
</p>
6 <class 'bs4.element.NavigableString'> 

7 <class 'bs4.element.Comment'> 
  Comment 2 
8 <class 'bs4.element.NavigableString'>

在该循环中，您现在将使用if(或您选择的任何方法(来决定要处理的元素类型，然后采取相应的行动。例如，" bs4.element.comment"的出现＆gt;有了"评论1"的内容，将表示"开始处理"，并带有"评论2"，"停止处理"。'bs4.element.Tag'的出现具有<p>的内容，这表明您需要下降一个级别并寻找p标签的孩子。等等。

繁琐但并不困难。

我个人会使用scrapy来做到这一点，因为从他们的div标签中获取这些标签或为段落标签设置ID或类的人会更容易。这是我从那里制作的一些代码的示例，用于从引号网站报废。我希望这有帮助。

安装：pip安装纸杯

import scrapy, sys, re, json
class QuotesSpider(scrapy.Spider):
    name = 'Quotes'
    start_urls = ['http://quotes.toscrape.com/page/1/']
    def parse(self, response):
        for quote in response.css('div.caption'):
            yield {
                    'text': quote.css('a.title::text').extract_first(),
                    'author': quote.xpath('div.snippet::text').extract_first()
                    }
        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

相关内容

最新更新

热门标签：