如何在Python中使用正则表达式在单个字符串中分离姓氏(大写)和名字(小写)



我正在做这个练习:

在这种情况下,姓氏用大写字母(大写大小写(,并放置在名字之前。

姓氏可以包含多个名字,可以用空格或连字符(-(。姓氏可以包含小写介词(Di,Mac(。

有时,名字和姓氏可以不加空格出现。

一个人可以有多个名字。

我在分组的行中尝试这些字符串=>第一组姓氏(大写(。第二组名称(小写(。

测试输入:

DiCAPRIO Leonardo Wilhelm
MacGYVER Angus
ANDERSON Richard Dean
ZETA-JONES Catherine
BONHAM CARTER Helena
DOUGLASMichael

输出(应该是什么样子(:

["DiCAPRIO"], ["Leonardo Wilhelm"]
["MacGYVER"], ["Angus"]
["ANDERSON"], ["Richard Dean"]
["ZETA-JONES"], ["Catherine"]
["BONHAM CARTER"], ["Helena"]
["DOUGLAS"], ["Michael"]

我有一个正则表达式:

([A-Z]{2,}s?-?[A-Z]{2,}|[A-Z]{2,})

(此正则表达式适用于https://regex101.com)

我使用函数re.findall()

在Python 3.x:中

for author in arrayAuthors:
print(re.findall(r'([A-Z]{2,}s?-?[A-Z]{2,}|[A-Z]{2,})', author))

在Python脚本中,它只捕获一个由两个名字组成的姓氏和一个带连字符的姓氏。

["ZETA-JONES"], ["Catherine"]
["BONHAM CARTER"], ["Helena"]

其他名称未分割返回:

["DiCAPRIO Leonardo Wilhelm"]
["MacGYVER Angus"]
["ANDERSON Richard Dean"]
["DOUGLASMichael"]
import re
# joining so I could call findall one time on a multiline string.
# each line is treated as it's own input
authors = 'n'.join(["DiCAPRIO Leonardo Wilhelm", "MacGYVER Angus", "ANDERSON Richard Dean", "ZETA-JONES Catherine", "BONHAM CARTER Helena", "DOUGLASMichael"])
# matching the first name and what's before is definitely the last name
pattern = r'(.+?)[ -]*([A-Z][a-z]+ ?[A-Z]*[a-z]*)'
# returns a list of tuples
print(re.findall(pattern, authors))

名字很容易用大写字母匹配,而不是用一系列小写字母匹配,这就是为什么我匹配名字,前面的是姓氏。一个工作示例可以在这里找到


输出

[('DiCAPRIO', 'Leonardo Wilhelm'), 
('MacGYVER', 'Angus'),
('ANDERSON', 'Richard Dean'),
('ZETA-JONES', 'Catherine'),
('BONHAM CARTER', 'Helena'),
('DOUGLAS', 'Michael')]

对于这个相当复杂的例子,我会选择regex与itertools.groupby:相结合

import re
from itertools import groupby

lst = [
'DiCAPRIO Leonardo Wilhelm',
'MacGYVER Angus',
'ANDERSON Richard Dean',
'ZETA-JONES Catherine',
'BONHAM CARTER Helena',
'DOUGLASMichael'
]
for v in lst:
l = re.sub(r'([A-Z])([A-Z][a-z]+)$', r'1 2', v).split()
out = [' '.join(g) for _, g in groupby(l, lambda k: bool(re.search(r'[a-z]$', k)))]
print(out)

打印:

['DiCAPRIO', 'Leonardo Wilhelm']
['MacGYVER', 'Angus']
['ANDERSON', 'Richard Dean']
['ZETA-JONES', 'Catherine']
['BONHAM CARTER', 'Helena']
['DOUGLAS', 'Michael']

对于示例数据,您可能使用2个捕获组,假设名称以大写字符A-Z 开头

((?:Di|Mac)?[A-Z]{2,}(?:[ -][A-Z]{2,})*) ?([A-Z][^WA-Z]+(?: [A-Z][^WA-Z]+)*)

部件内

  • (捕获组1
    • (?:Di|Mac)?可选匹配DiMac
    • [A-Z]{2,}匹配2个或多个字符A-Z
    • (?:[ -][A-Z]{2,})*在一个空格或-和2个或多个字符a-Z中重复0+次
  • ) ?关闭组1并匹配可选空间
  • (捕获组2
    • [A-Z][^WA-Z]+匹配A-Z和除A-Z之外的单词字符的1+倍
    • (?: [A-Z][^WA-Z]+)*在前面加空格的情况下重复0次以上的模式
  • )关闭组2

Regex演示| Python演示

例如

import re
arrayAuthors = [
"DiCAPRIO Leonardo Wilhelm",
"MacGYVER Angus",
"ANDERSON Richard Dean",
"ZETA-JONES Catherine",
"BONHAM CARTER Helena",
"DOUGLASMichael"
]
regex = r"((?:Di|Mac)?[A-Z]{2,}(?:[ -][A-Z]{2,})*) ?([A-Z][a-z]+(?: [A-Z][a-z]+)*)"
for author in arrayAuthors:
print(re.findall(regex, author))

输出

[('DiCAPRIO', 'Leonardo Wilhelm')]
[('MacGYVER', 'Angus')]
[('ANDERSON', 'Richard Dean')]
[('ZETA-JONES', 'Catherine')]
[('BONHAM CARTER', 'Helena')]
[('DOUGLAS', 'Michael')]

最新更新