拆分包含多个子字符串的字符串



我有一个字符串列表names

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

我想拆分包含多个子字符串的字符串:

substrings = ['Vice president', 'Affiliate', 'Acquaintance']

更确切地说,我想在子字符串后面的单词的最后一个字符后面拆分

desired_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose']

我不知道如何在我的代码中实现"多个"条件:

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
substrings = re.compile(r'Vicespresident|Affiliate|Acquaintance')
splitted = []
for i in names:
if substrings in i:
splitted.append([])
splitted[-1].append(item)

例外:当最后一个字符是一个点(例如Prof.(时,在子字符串后面的第二个字后面拆分。


更新:names比我想象的更复杂,并且遵循

  1. 类似标题的模式已经正确回答('Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose'(
  2. 直到第二串模式跟随('Mister Kelly, AWS'(
  3. 直到第三串模式跟随直到结束('Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary'(

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose', 'Vice president Dr. John Mister Schmid, PRT Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary']

有时Secretary后面跟着不同的规范。我不在乎Secretary后面的这些字符,直到下一个名字出现。它们可以被丢弃。当然,'Secretary'应该像在updated_output中那样存储。

我创建了一个(希望是详尽无遗的(specifications列表,列出了Secretary之后的内容。以下是列表的表示形式:specifications = ['', ' of State', ' for Relations', ' for the Interior', ' for the Environment']

更新的问题:如何使用specification列表解释第三种模式?

updated_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose', 'Vice president Dr. John', 'Mister Schmid, PRT', 'Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary of State', 'Dr. Dews, Member', 'Miss Berg, Secretary for Relations, 'Dr. Jakob, Secretary']

您想在这三个标题之一之前的单词边界处进行拆分,这样您就可以为其中一个标题查找单词边界b,然后是正向前瞻(?=...)

>>> s = 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose'
>>> v = re.split(r"b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
['', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

然后,您可以修剪并丢弃空结果:

>>> v = [x for i in v if (x := i.strip())]
['Vice president Johnson', 'affiliate Peterson', 'acquaintance Dr. Rose']

对于输入字符串列表,只需将此处理应用于所有字符串:

def get_names(s):
v = re.split(r"b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
return [x for i in v if (x := i.strip())]

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
output = []
for n in names:
output.extend(get_names(n))

哪个给出:

output = ['Acquaintance Muller',
'Vice president Johnson',
'Affiliate Peterson',
'Acquaintance Dr. Rose']

尝试:

import re
names = [
"acquaintance Muller",
"Vice president Johnson affiliate Peterson acquaintance Dr. Rose",
]
substrings = ["Vice president", "affiliate", "acquaintance"]
r = re.compile("|".join(map(re.escape, substrings)))
out = []
for n in names:
starts = [i.start() for i in r.finditer(n)]
if not starts:
out.append(n)
continue
if starts[0] != 0:
starts = [0, *starts]
starts.append(len(n))
for a, b in zip(starts, starts[1::]):
out.append(n[a:b])
print(out)

打印:

['acquaintance Muller', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

最新更新