python3库,用于识别电话号码、姓名、电子邮件和地址



假设我已经成功地获得了这个文本,然后我用名称textToModify:分配它们

textToModify = "
abcde abcde
Title: Director, lorem company
Phone: 123.647.4555                 
Mobile: 123.123.1234                    E-mail: try1@umich.edu                  Assistant: my name                  Assistant Phone: 667.889.9910
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Linkedin: www.linkedin.com/in/lorem-ipsum/
Twitter: www.twitter.com/ipsum
"

现在我想从这个文本中提取标题、姓名、电话号码、linkedin、twitter和其他重要信息。有这样一个图书馆可以这样做吗?或者你有这样的想法吗?假设此文本的格式是随机的,但单词title将始终位于标题本身旁边,单词phone将始终位于电话旁边,等等。

我最初的想法:

nltk库不起作用,因为它基本上为单词分配标识符,问题是,该文本不是按单词分隔的,而是按字符分隔的,例如,如果您访问textToModify[20],它只会返回一个字符

我的另一个想法是,如果我访问链接,然后截图它们,然后在python中使用(如果存在(图片到文本库,然后从那里开始,会怎么样

谢谢!

如果在变量中有它,则可以使用pythonre模块使用regex进行匹配。

此SO邮件地址为电话号码

此网页向您展示了检测电子邮件的步骤

对于名称和地址,除非它们前面有Name:Address:,或者你可以应用一些逻辑来查找它,否则你可能会比以前想象的更困难。这篇SO文章给出了一个尝试匹配地址的例子

希望这能有所帮助。我想过写一个完整的答案,但SO和其他网站上的RegEx资源相当丰富

像这样的程序可以执行您想要的操作:

finds = {}
texttoModify = texttoModify.split()
for element in enumerate(texttoModify):
if element[1] == 'Title:':
finds['title'] = texttoModify[element[0]+1]

但是,您需要为每个要获取的元素创建if,并为带有两个单词的名称等事物使用接下来的两个元素。

相关内容

最新更新