解析 LaTex 作者标签以提取作者姓名



1. 作者标签:

author{{small Tanya Araujo$^{a,b}$ and Elsa Fontainha$^{a}$} and {small $^{a}$ISEG
(Lisbon School of Economics & Management) Universidade de Lisboa, } and
{small Rua do Quelhas, 6 1200-781 Lisboa Portugal} and {small $^{b}$Research
Unit on Complexity and Economics (UECE)} and {small Rua Miguel Lupi, 20
1249-078 Lisboa Portugal}}
author{{bf R. Vilela Mendes} and {small Grupo de Fisica Matematica, Av.
Gama Pinto 2,} and {small  1699 Lisboa Codex, Portugal
(vilela@cii.fc.ul.pt)} and {bf Tanya Araujo and Francisco Loucca%
} and {small Departamento de Economia, ISEG,} and {small R. Miguel Lupi
20, 1200 Lisboa, Portugal} and {small (tanya@iseg.utl.pt,
flouc@iseg.utl.pt)}}

2.删除了特殊字符,其他标签,电子邮件和数字:

Tanya Araujo和Elsa Fontainha ISEG 里斯本经济与管理学院 里斯本大学,Rua do Quelhas, - 里斯本葡萄牙研究 复杂性和经济学单元 UECE Rua Miguel Lupi, - 葡萄牙里斯本

R. 维莱拉·门德斯 Fisica Matematica集团, Av. 伽马平托,葡京法典,葡萄牙 坦尼娅·阿劳霍和弗朗西斯科·卢 经济部,ISEG,R.米格尔·卢皮 , 里斯本, 葡萄牙 ,

3.期望输出:仅提取名称并删除大学名称或任何位置名称。尝试使用NLTK的NER但将大学和里斯本识别为PERSON等。

(PERSON Tanya/NNP)
(PERSON Araujo/NNP)
and/CC
(PERSON Elsa/NNP Fontainha/NNP)
ISEG/NNP
(/(
(ORGANIZATION Lisbon/NNP School/NNP)
of/IN
(ORGANIZATION Economics/NNP)
&/CC
Management/NNP
)/)
(PERSON Universidade/NNP)
de/FW
(PERSON Lisboa/NNP)
,/,
(PERSON Rua/NNP)
do/VBP
(PERSON Quelhas/NNP)
,/,
-/:
(PERSON Lisboa/NNP Portugal/NNP Research/NNP Unit/NNP)
on/IN
(ORGANIZATION Complexity/NNP)
and/CC
(GPE Economics/NNP)
(/(
(ORGANIZATION UECE/NNP)
)/)
(PERSON Rua/NNP Miguel/NNP Lupi/NNP)
,/,
-/:
(PERSON Lisboa/NNP Portugal/NNP Alessandro/NNP Spelta/NNP)
corresponding/VBG
author/NN
:/:
and/CC
(PERSON Tanya/NNP Araujo/NNP))

是否可以使用 NLTK 的 NER 解决此问题,或者我们应该尝试任何其他库,如 spaCy?

您可以使用 https://github.com/alvinwan/TexSoup,它将提取作者元素,如下所示。

>>> from TexSoup import TexSoup
>>> soup = TexSoup(open('tri7.txt').read())
>>> for i in soup.find_all('author'):
...     i
...     
author{{small Tanya Araujo$^{a,b}$ and Elsa Fontainha$^{a}$} and {small $^{a}$ISEG
(Lisbon School of Economics & Management) Universidade de Lisboa, } and
{small Rua do Quelhas, 6 1200-781 Lisboa Portugal} and {small $^{b}$Research
Unit on Complexity and Economics (UECE)} and {small Rua Miguel Lupi, 20
1249-078 Lisboa Portugal}}

然后你可以提取字符串,

例如
{{small Tanya Araujo$^{a,b}$ and Elsa Fontainha$^{a}$}

在这种情况下,以多种方式中的任何一种。 最后,如果您不能让 TexSoup 为您执行此操作,您可以使用正则表达式删除诸如small$(a,b)$之类的项目。

最新更新