统计文本文件中不同部分中的单词



对于一个项目,我必须用python分析一个包含200多份简历的txt文件。我必须通过文件搜索,并必须计数,如果一个特定的关键是提到。这是我非常简单的代码:

file = open("CVC.txt")
data=file.read()
occurence = data.count("Biology")
print('Number of occurrences of the word :', occurence) 

问题是当我搜索例如工程时,它在一份简历中被提及多次。但我只想数一次。每一份简历都以"联系"这个词开头。我的问题是我如何指定一个算法来区分简历,并且只计算简历中的特定关键字。

提前感谢!

ex1ex2

逻辑比较简单。当您看到开始一个联系人的行时,解析文件的每一行,然后存储该行及其后的所有内容,直到看到下一个联系人行。当文件读取完成后,将剩余的行存储为上次开始的联系的一部分。

contacts = []
current_contact = None
with open("CVC.txt") as data:
for line in data.splitlines():
# skip page lines (e.g. in middle of a contact)
if line.strip().startswith("Page "):
continue
# start a new contact
if line.strip() == "Contact":
if current_contact is not None:
# store the current contact lines, if they exist
contacts.append('n'.join(current_contact))
current_contact = []
continue
# collect all lines for a single contact
if current_contact is not None:
current_contact.append(line.rstrip())
else:
print(f"Not seen 'Contact' yet... '{line.rstrip()}'")  # for debugging, e.g. start of the file
# store remaining data after all lines are read
if current_contact:
contacts.append('n'.join(current_contact))
del current_contact

我创建了一个像这样的示例文件

Contact
https://linkedin.com/1
Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus
Page 1 of 2
Hic dignissimos consequatur error.
Contact
https://linkedin.com/2
Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus. Hic dignissimos consequatur error.

这个测试输出

>>> for c in contacts:
...   print(c.splitlines())
... 
['', 'https://linkedin.com/1', '', 'Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus', '', '', 'Hic dignissimos consequatur error.']
['', 'https://linkedin.com/2', '', 'Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus. Hic dignissimos consequatur error.']

要计算一个联系人中的单词数,您可以通过位置

访问
contacts[0].count("Biology")

这是一个逻辑更简单的解决方案,创建一个标志,告诉如果1。我们在一个接触点和2。如果我们已经在这个联系人中见过这个词。

counter = 0 
is_counted = True # Initialize the flag to avoid the code breaking
word = 'engineering' # Change this
with open('cv.txt','r') as file:
line = file.readline()
while line:
if "contact" in line.lower():
is_counted = False
elif is_counted == False and word in line.lower():
counter += 1
is_counted = True
line = file.readline()
print(counter)

我已经成功地在一个小样本上尝试了它,在你的输入上尝试一下,看看它是否有效。

最新更新