将re.findall创建的列表拆分为单个单词,然后按出现次数降序计算每个单词的出现次数



我必须计算由re.findall.创建的列表中每个元素中每个单词的出现次数

例如:jobs=["Java开发人员","数据科学家","业务架构师流程挖掘","JavaScript开发人员"]

jobs_split=["Java"、"Developer"、"Data"、"Scientist"、"Business"、"Architect"、"Process"、"Mining"、"JavaScript"、"Developer"]

然后计算每个单词的出现次数,并在文件中显示为单词:出现次数

我知道我可以在python中构建"Counter",但我不知道如何拆分列表中的所有元素

import urllib.request
import re
from collections import Counter
jobs = []
jobs_split = []
from urllib.request import urlopen, Request
for i in range(10):
html = Request("https://mysite?pn={}".format(i), headers={'User-Agent':         'Mozilla/5.0'})
page = urlopen(html).read().decode('utf-8')
jobs += re.findall(r'"@type":"JobPosting","title":"([A-Za-z0-9 -/]+)","description"', page)
my_set = set(jobs)
# print(Counter(my_set))
print(my_set)

您可以使用itertools.chain将所有单词连接到一个可迭代的单词中:

from collections import Counter
from itertools import chain
jobs = ["Java Developer","Data Scientist","Business Architect Process Mining","JavaScript Developer"]
tokens = chain.from_iterable(job.split() for job in jobs)
counts = Counter(tokens)
print(counts)

输出

Counter({'Developer': 2, 'JavaScript': 1, 'Architect': 1, 'Process': 1, 'Mining': 1, 'Business': 1, 'Scientist': 1, 'Java': 1, 'Data': 1})

简单到使用.split()并在空间" "上进行拆分

但必须反复浏览您的列表:

jobs = ["Java Developer","Data Scientist","Business Architect Process Mining","JavaScript Developer"]
split = [ job.split() for job in jobs ]
jobs_split = [item for sublist in split for item in sublist]

最新更新