目前我想对我目录中的所有文本文件执行空格NER,并将其作为输出" NER/文本中的总字数"。我不知道如何使它自动化。目前我使用:
def read_txt_files(PATH:str):
results = defaultdict(list)
for file in Path(PATH).iterdir():
with open(file, "rt",newline='', encoding="utf8") as file_open:
results["file_num"].append(file.name)
results["text"].append(file_open.read().replace('n'," "))
df = pd.DataFrame(results)
return df
def Specificity(input_data: pd.Series):
specificity = [0]*len(input_data)
for i in tqdm(range(len(input_data)), desc = 'Get the Specificity'):
specificity[i] = len((ner(input_data[i])).ents)/len((input_data[i]))
#[len(ner(data[i]).ents)/len(data[i]) for i in tqdm(range(len(data)))]
return specificity
但不知何故,它只是显示了错误的特异性值,远低于应有的值。
当我对单个文本文件执行NER时,它看起来像这样:
import spacy
nlp = spacy.load("en_core_web_sm")
text = open(r"mydirectory", 'r',encoding='utf-8').read()
parsed_text = nlp(text)
named_entities = parsed_text.ents
num_words = len ([ token
for token in parsed_text
if not token . is_punct ])
num_entities = len ( named_entities )
specificity_score = num_entities/num_words
有没有办法"切换"?既要具体措施,又要让"二次";代码执行?
由于Pandas
和SpaCy
已经提供了方便的功能,我试图以更清晰的格式实现您的解决方案。明显的问题是nlp(text)
被确定了两次,这是一个负担!
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
df = pd.DataFrame({'text':['This is a test from USA', 'Ronaldo missed the WC']})
get_spec = lambda text:len(nlp(text).ents)/len([token for token in nlp(text) if not token.is_punct])
df['spec'] = df['text'].apply(get_spec)
print(df)
输出:
text spec
0 This is a test from USA 0.166667
1 Ronaldo missed the WC 0.250000