如何从漂亮的汤类标签中获取实际文本?


  • Python 版本: 3.8
  • BS4 库

我有以下 HTML,它代表了我抓取的大约 20+ 条评论中的 20 条。由于空间原因,我没有在这里包括其余部分,但您可以想象这些块不断重复。

我需要从每条评论中检索"sml-rank-stars sml-str40 star"(如此处第二行所示)。

<div class="review-rank">
<span class="sml-rank-stars sml-str40 star"></span>
<span class="score">
<span class="item">
口味:3.5
</span>
<span class="item">
环境:4.0
</span>
<span class="item">
服务:3.5
</span>
<span class="item">人均:200元</span>
</span>
</div>
<div class="review-rank">
<span class="sml-rank-stars sml-str35 star"></span>
<span class="score">
<span class="item">
口味:3.0
</span>
<span class="item">
环境:4.5
</span>
<span class="item">
服务:3.0
</span>
</span>
</div>

这是我到目前为止尝试过的:

for review in review_items.find_all('div', class_='main-review'):
review_rank = review.find('div', class_='review-rank')
star_rank = []
for review in review_rank.find_all('span')[:1]:
star_rank.append(review.get('class'))
print(star_rank)

我得到结果输出:

[['sml-rank-stars', 'sml-str5', 'star']]

然后,我可以使用此代码仅获取数字:

star_rank[0][1][7:]

输出:

'5'

这样做的问题是我只收到其中一条评论,我需要为存储在列表中的每条评论提供此行。

我想要的输出是这样的,或者我可以迭代以获得每个评论的星数:

[['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str35', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str10', 'star'],
['sml-rank-stars', 'sml-str35', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str10', 'star'],
['sml-rank-stars', 'sml-str5', 'star']]

我已经想出了如何使用以下代码打印出这样的结果,但我需要将其保存到列表中或我可以迭代的其他内容中。

for review in review_items.find_all('div', class_='main-review'):
review_rank = review.find('div', class_='review-rank')
for review in review_rank.find_all('span')[:1]:
print(review.get('class'))

输出:

['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str35', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str10', 'star']
['sml-rank-stars', 'sml-str35', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str10', 'star']
['sml-rank-stars', 'sml-str5', 'star']

要遍历所有.review-rank选择所有 - 要获得排名,请使用列表推导:

star_rank = []
for r in soup.select('.review-rank'):
star_rank.append([s.replace('sml-str','') for s in r.span['class'] if 'sml-str' in s][0])

或者像你的例子一样,不知道review_items上面的通用结构是什么,如果只有一个或多个:

star_rank = []
for review in review_items.find_all('div', class_='main-review'):
for review in review.find_all('div', class_='review-rank'):
star_rank.append([s.replace('sml-str','') for s in review.span['class'] if 'sml-str' in s][0])

输出

['40', '35']

最新更新