所以我遵循了这个问题的答案(使用spacy从数据帧中提取实体(,这解决了我在DF上迭代的问题。
我面临的问题是尝试获取这些结果,从原始df中添加一列,然后将所有这些放入新的df中。我想要来自原始df的DOI,来自NER的实体文本和实体标签。
要获取并放入列表的代码:
entities=[]
nlp = spacy.load("en_ner_bionlp13cg_md")
for i in df['Abstract'].tolist():
doc = nlp(i)
for entity in doc.ents:
entities.append((df.DOI, entity.text , entity.label_))
然后我获取实体列表,并将其输入到一个新的df:中
df_ner = pd.DataFrame.from_records(entities, columns =['DOI', 'ent_name', 'ent_type'])
不幸的是,只有第一条记录被加载到df中。我错过了什么?
DOI ent_name ent_type
0 3 10.7501/j.issn.0253-2670.2020.... COVID-19 GENE_OR_GENE_PRODUCT
1 3 10.7501/j.issn.0253-2670.2020.... ACE2 GENE_OR_GENE_PRODUCT
2 3 10.7501/j.issn.0253-2670.2020.... angiotensin converting enzyme II GENE_OR_GENE_PRODUCT
3 3 10.7501/j.issn.0253-2670.2020.... ACE2 GENE_OR_GENE_PRODUCT
4 3 10.7501/j.issn.0253-2670.2020.... UniProt GENE_OR_GENE_PRODUCT
这很有效:
getattr(tqdm, '_instances', {}).clear() # ⬅ clear the progress
spacy.prefer_gpu()
nlp = spacy.load("en_ner_bionlp13cg_md")
entities=[]
uuid = str(df.uuid)
for i in tqdm(df['Abstract'].tolist(),bar_format="{l_bar}%s{bar}%s{r_bar}" % (Fore.GREEN, Fore.RESET)):
doc = nlp(i)
for entity in doc.ents:
entities.append((i, entity.text, entity.label_, uuid))