Python:计算HTML中的特定单词



所以,我是一个Python新手,学习网页抓取非常困难。我打算计算此HTML页面中的单词数量,并显示哪些单词只出现一次,以及单词"女士"显示的次数。到目前为止,我已经设法想出了这个:

import requests
from bs4 import BeautifulSoup
import operator
from collections import Counter
def my_start(url):
my_wordlist = []
my_source_code = requests.get(url).text
my_soup = BeautifulSoup(my_source_code, 'html.parser')
for each_text in my_soup.findAll('p', {'class':'about-text'}):
content = each_text.text
words = content.lower().split()
for each_word in words:
my_wordlist.append(each_word)
clean_wordlist(my_wordlist)
def clean_wordlist(wordlist):
clean_list =[]
for word in wordlist: 
symbols = '!@#$%^&*()_-+={[}]|;:"<>?/., '
for i in range (0, len(symbols)):
word = word.replace(symbols[i], '')
if len(word) > 0:
clean_list.append(word)
create_dictionary(clean_list)
def create_dictionary(clean_list):
word_count = {}
for word in clean_list:
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1
c = Counter(word_count)
print(c)
if word_count[word] == 1:
print(word)
top = soup.find_all("ladies")
print(top)
if __name__ == '__main__':
my_start("http://brasil.pyladies.com/about/")

我注意到有些单词只出现一次,但此处未显示,还有一个单词出现两次并显示。我也不知道如何计算"女士"这个词出现的次数。对此事的任何意见将不胜感激!

我建议你使用正则表达式(regex(来解决这个问题

import re
my_source_code = requests.get(url).text
pattern = "ladies"
ladies_count = len(re.findall(pattern, my_source_code))

这是从文本中计算单词的最快速方法

top = soup.find_all("ladies")

在这里,find_all的用法是错误的。它用于搜索HTML标签,而不是单词。

如果要打印"女士"一词出现的次数,请尝试

print(word_count.get('ladies','0'))

最新更新