为数据框架建立空间NER循环

目前我想对我目录中的所有文本文件执行空格NER，并将其作为输出" NER/文本中的总字数"。我不知道如何使它自动化。目前我使用:

def read_txt_files(PATH:str):

results = defaultdict(list)
for file in Path(PATH).iterdir():
with open(file, "rt",newline='', encoding="utf8") as file_open:
results["file_num"].append(file.name)
results["text"].append(file_open.read().replace('n'," "))
df = pd.DataFrame(results)

return df
def Specificity(input_data: pd.Series):
specificity = [0]*len(input_data)

for i in tqdm(range(len(input_data)), desc = 'Get the Specificity'):
specificity[i] = len((ner(input_data[i])).ents)/len((input_data[i]))

#[len(ner(data[i]).ents)/len(data[i]) for i in tqdm(range(len(data)))]

return specificity

但不知何故，它只是显示了错误的特异性值，远低于应有的值。

当我对单个文本文件执行NER时，它看起来像这样:

import spacy
nlp = spacy.load("en_core_web_sm")
text = open(r"mydirectory", 'r',encoding='utf-8').read()
parsed_text = nlp(text)
named_entities = parsed_text.ents
num_words = len ([ token
for token in parsed_text
if not token . is_punct ])
num_entities = len ( named_entities )
specificity_score = num_entities/num_words

有没有办法"切换"?既要具体措施，又要让"二次";代码执行?

由于Pandas和SpaCy已经提供了方便的功能，我试图以更清晰的格式实现您的解决方案。明显的问题是nlp(text)被确定了两次，这是一个负担!

import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
df = pd.DataFrame({'text':['This is a test from USA', 'Ronaldo missed the WC']})
get_spec = lambda text:len(nlp(text).ents)/len([token for token in nlp(text) if not token.is_punct])
df['spec'] = df['text'].apply(get_spec)
print(df)

输出:

text      spec
0  This is a test from USA  0.166667
1    Ronaldo missed the WC  0.250000

相关内容

最新更新

热门标签：