我正在研究自然语言处理,需要预处理一些数据。我的数据在文本文件中,我必须读取数据并将所有名称更改为男性或女性。
读取数据并对其进行标记后,我应用 pos 标记并检查具有名称列表的文件,并将名称更改为"男性"或"女性">
例如:
["杰克"、"和"、"吉尔"、"去了"、"上"、"山"]
应改为
["男性"、"和"、"女性"、"去"、"上"、"山"]
基于以下 POS
[("杰克","NNP"(, ("和","抄送"(, ("吉尔","NNP"(, ("去","NNP"(, ("向上","IN"(, ("的","DT"(, ("山丘"、"NN"(]
我的代码如下:
import nltk
text = open('collegegirl.txt').read()
with open('male_names.txt') as f1:
male = nltk.word_tokenize(f1.read())
with open('female_names.txt') as f2:
female = nltk.word_tokenize(f2.read())
data = nltk.pos_tag(nltk.word_tokenize(text))
for word, pos in data:
if(pos == 'NNP'):
if word in male:
word = 'Male'
if word in female:
word = 'Female'
上面的代码只是检查单词,而不是写任何东西。如何编辑数据中的名称。我是蟒蛇的新手。提前谢谢。
在我个人看来,最好使用Spacy进行POS标记,这更快,更准确。此外,您可以使用其命名实体识别来检查单词是否为 PERSON。安装 spacy 并从此处下载en_core_web_lg
模型 https://spacy.io/usage/
您的问题可以通过以下方式解决:
import spacy
from functools import reduce
nlp_spacy = spacy.load('en_core_web_lg')
NAMELIST = {'Christiano Ronaldo':'Male', 'Neymar':'Male', 'Messi':'Male', "Sandra":'Female'}
with open("input.txt") as f:
text = f.read()
doc = nlp_spacy(text)
names_in_text = [(entity.text, NAMELIST[entity.text]) for entity in doc.ents if entity.label_ in ['PERSON'] and entity.text in NAMELIST]
print(names_in_text) #------- prints [('Christiano Ronaldo', 'Male'), ('Messi', 'Male')]
replaced_text = reduce(lambda x, kv: x.replace(*kv), names_in_text, text)
print(replaced_text) #------- prints Male scored three. Male scored one. Female is an athlete. I am from US.
拆分文本并在for
循环中执行:
for i, (word, pos) in enumerate(data):
if(pos == 'NNP'):
if word in male:
data[i] = ('Male', pos)
if word in female:
data[i] = ('Female', pos)
array = [text for (text, pos) in data]
更多的python方法可以做到这一点:
array = [x if (not pos == "NNP" and not x in male and not x in female) else ("Female" if (x in female) else ( "Male" if (x in male) else x)) for (x, pos) in data]