按分隔符列表拆分字符串,而不考虑顺序



我有一个字符串text和一个列表names

  • 我想在每次出现names元素时拆分text

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'

names = ['Mike', 'Monika']

期望的输出:

output = [['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]

常见问题

  • text并不总是以names元素开头。感谢维克多·李指出这一点。我不在乎那个主要部分,但其他人可能会关心,所以感谢回答"两种情况"的人
  • names分隔符的顺序与它们在text中的出现无关。
  • names中的分隔符是唯一的,但在整个text中可以多次出现。因此,输出将包含比names包含字符串的列表多。
  • text永远不会有相同的唯一names元素连续出现两次/<>。
  • 最终,我希望输出是一个列表列表,其中每个拆分text切片对应于其分隔符,它被拆分。列表的顺序很重要。

re.split()不会让我使用列表作为分隔符参数。我可以re.compile()分隔符列表吗?


update:托马斯代码最适合我的情况,但我注意到一个我以前没有意识到的警告:

names的某些元素前面是"夫人"或"先生",而text中只有一些相应的匹配项前面是"夫人"或"先生"。

<小时 />

到目前为止:

names = ['Mr. Mike, ADS', 'Monika, TFO', 'Peter, WQR']
text1 = ['Mrs. Monika, TFO goes shopping. Then she rides bike. Mike, ADS likes Pizza. Monika, TFO hates me.']
text = str(text1)[1:-1]
def create_regex_string(name: List[str]) -> str:
name_components = name.split()
if len(name_components) == 1:
return re.escape(name)
salutation, *name = name_components
return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"

regex_string = "|".join(create_regex_string(name) for name in names)
group_count = regex_string.count("(") + 1
fragments = re.split(f"({regex_string})", text)
if fragments:
# ignoring text before first occurrence, not specified in requirements
if not fragments[0] in names: 
fragments = fragments[1:]
result = [[name, clist.rstrip()] for name, clist in zip(
fragments[::group_count+1],
fragments[group_count::group_count+1]
) if clist is not None
]
print(result)
[['Monika, TFO', ' goes shopping. Then she rides bike.'], ['Mike, ADS', ' likes Pizza.'], ['Monika, TFO', " hates me.'"]]

错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [86], in <module>
111     salutation, *name = name_components
112     return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
115 group_count = regex_string.count("(") + 1
116 fragments = re.split(f"({regex_string})", clist)
Input In [86], in <genexpr>(.0)
111     salutation, *name = name_components
112     return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
115 group_count = regex_string.count("(") + 1
116 fragments = re.split(f"({regex_string})", clist)
Input In [86], in create_regex_string(name)
109 if len(name_components) == 1:
110     return re.escape(name)
--> 111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
ValueError: not enough values to unpack (expected at least 1, got 0)

如果您正在寻找使用正则表达式的方法,那么:

import re
def do_split(text, names):
joined_names = '|'.join(re.escape(name) for name in names)
regex1 = re.compile('(?=' + joined_names + ')')
strings = filter(lambda s: s != '', regex1.split(text))
regex2 = re.compile('(' + joined_names + ')')
return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

指纹:

[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]

解释

首先,我们从过去的names参数动态创建一个正则表达式regex1为:

(?=Mike|Monika)

当您对此拆分输入时,由于任何传递的名称都可能出现在输入的开头或结尾,因此最终可能会在结果中出现空字符串,因此我们将过滤掉这些字符串并得到:

['Monika goes shopping. Then she rides bike. ', 'Mike likes Pizza. ', 'Monika hates me.']

然后我们将每个列表拆分为:

(Mike|Monika)

我们再次过滤掉任何可能的空字符串以获得最终结果。

所有这一切的关键是,当我们拆分的正则表达式包含捕获组时,该捕获组的文本也会作为结果列表的一部分返回。

更新

您没有指定如果输入文本不带有其中一个名称时应发生的情况。假设您可能希望忽略所有字符串,直到找到其中一个名称,请查看以下版本。同样,如果文本不包含任何名称,则更新后的代码将只返回一个空列表:

import re
def do_split(text, names):
joined_names = '|'.join(re.escape(name) for name in names)
regex0 = re.compile('(' + joined_names + ')[sS]*')
m = regex0.search(text)
if not m:
return []
text = m.group(0)
regex1 = re.compile('(?=' + joined_names + ')')
strings = filter(lambda s: s != '', regex1.split(text))
regex2 = re.compile('(' + joined_names + ')')
return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]
text = 'I think Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

指纹:

[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]

反对 使用正则表达式,您还可以将文本重建为合适的格式,这将通过split方法获得预期的结果。并添加一些字符串格式过程。

# works on python2 or python3, but the time complexity is O(n2) means n*n
def do_split(text, names):
my_sprt = '|'
tmp_text_arr = text.split()
for i in range(len(tmp_text_arr)):
for sprt in names:
if sprt == tmp_text_arr[i]:
tmp_text_arr[i] = my_sprt + sprt + my_sprt
tmp_text = ' '.join(tmp_text_arr)
if tmp_text.startswith(my_sprt):
tmp_text = tmp_text[1:]
tmp_text_arr = tmp_text.split(my_sprt)
if tmp_text_arr[0] not in names:
tmp_text_arr.pop(0)
out_arr = []
for i in range(0, len(tmp_text_arr) - 1, 2):
out_arr.append([tmp_text_arr[i], tmp_text_arr[i + 1].rstrip()])
return out_arr
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
text = 'today Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

此代码将与名称中不以元素开头的文本兼容

关键点:将文本值重新格式化为使用自定义分隔符(如|)|Monika| goes shopping. Then she rides bike. |Mike| likes Pizza. |Monika| hates me.,这不应该出现在原始文本中。

我采用了您给出的解决方案之一并对其进行了轻微重构。

def split(txt, seps, actual_sep='1'):
order = [item for item in txt.split() if item in seps ]
for sep in seps:
txt = txt.replace(sep, actual_sep)
return list( zip( order, [i.strip() for i in txt.split(actual_sep) if bool(i.strip())] ) )
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print( split(text, names) )

编辑

解决此处提到的一些边缘情况的另一种解决方案。

def split(txt, seps, sep_pack='1'):
for sep in seps:
txt = txt.replace(sep, f"{sep_pack}{sep}{sep_pack}")

lst = txt.split(sep_pack)
temp = []
idx = 0
for _ in range(len(lst)):
if idx < len(lst):
if lst[idx] in seps:
temp.append( [lst[idx], lst[idx+1]] )
idx+=2
else:
temp.append( ['', lst[idx]] )
idx+=1
return temp

虽然有点丑,希望改进。

这与这里的一些答案类似,但更简单。

有三个步骤:

  1. 查找分隔符的所有匹配项
  2. 拆分剩余文本
  3. 根据需要将 (1) 和 (2) 的结果合并到列表列表中

我们可以组合 (1) 和 (2),但它使创建列表列表变得更加复杂。

import re
def split_on_names(names: list[str], text: str) -> list[list[str]]:
pattern = re.compile("|".join(map(re.escape, names)))
# step 1: find the separators (in order)
separator = pattern.findall(text)
# step 2: split out the text between separators
remainder = list(filter(None, pattern.split(text)))
# at this point, if `remainder` is longer, it's because `text` 
# didn't start with a separator. So, we add a blank separator
# to account for the prefix.
if len(remainder) > len(separator):
separator = ["", *separator]
# step 3: reshape the results into a list of lists
return list(map(list, zip(separator, remainder)))
names = ["Mike", "Monika"]
text = "Hi Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me."
split_on_names(names, text)
# output:
#
# [
#    ['', 'Hi '],
#    ['Monika', ' goes shopping. Then she rides bike. '],
#    ['Mike', ' likes Pizza. '],
#    ['Monika', ' hates me.']
# ]

您可以将re.splitzip一起使用:

import re
from pprint import pprint
text = "Monika goes shopping. Then she rides bike. Mike likes Pizza." 
"Monika hates me."
names = ["Henry", "Mike", "Monika"]
regex_string = "|".join(re.escape(name) for name in names)
fragments = re.split(f"({regex_string})", text)
if fragments:
# ignoring text before first occurrence, not specified in requirements
if not fragments[0] in names: 
fragments = fragments[1:]
result = [
[name, text.rstrip()] 
for name, text in zip(fragments[::2], fragments[1::2])
]
pprint(result)

输出:

[['Monika', ' goes shopping. Then she rides bike.'],
['Mike', ' likes Pizza.'],
['Monika', ' hates me.']]

笔记:

  • 这是对问题修订版 9 的回答。

    • 考虑到问题修订版 11 的变化,本答案的最后有一个更新。
  • 您不指定是否应考虑名称第一次出现之前的"文本"。

    • 上面的脚本在第一次出现之前忽略了"文本"。
  • 您也没有指定如果文本以名称结尾会发生什么情况。

    • 上面的脚本将通过添加空字符串来包含出现。但是,如果"文本"是空字符串,则可以通过删除最后一个元素轻松解决。
  • zip工作是因为fragments中总是有偶数个元素。如果第一个元素与名称(文本或空字符串)不匹配,我们将删除它,如果文本以名称结尾,则最后一个元素始终为空字符串。

re.split

如果分隔符中有捕获组,并且它在字符串的开头匹配,则结果将以空字符串开头。字符串的末尾也是如此 [...]


这是相同的示例,但在第一次出现之前不会忽略"文本":

import re
text = "Hi. Monika goes shopping. Then she rides bike. Mike likes Pizza." 
"Monika hates me."
names = ["Henry", "Mike", "Monika"]
regex_string = "|".join(re.escape(name) for name in names)
fragments = re.split(f"({regex_string})", text)
if fragments:
# not ignoring text before first occurrence; use empty string as name
if fragments[0].strip() == "":
fragments = fragments[1:]
elif not fragments[0] in names:
fragments = [""] + fragments
result = [
[name, text.rstrip()]
for name, text in zip(fragments[::2], fragments[1::2])
]
# # remove empty text
# if result and not result[-1][1]:
#     result = result[:-1]
print(result)  # [['', 'Hi.'], ['Monika', ...] ..., ['Monika', ' hates me.']]

笔记:

  • 这是对问题修订版 9 的回答。
    • 考虑到问题修订版 11 的变化,本答案的最后有一个更新。

问题修订版 11 的更新

在尝试包含 id345678 附加要求后:

import re
from pprint import pprint
from typing import List
def create_regex_string(name: List[str]) -> str:
name_components = name.split()
if len(name_components) == 1:
return re.escape(name)
salutation, name_part = name_components
return f"({re.escape(salutation)} )?{re.escape(name_part)}"
text = "Monika goes shopping. Then she rides bike. Dr. Mike likes Pizza. " 
"Mrs. Monika hates me. Henry needs a break."
names = ["Henry", "Dr. Mike", "Mrs. Monika"]
regex_string = "|".join(create_regex_string(name) for name in names)
group_count = regex_string.count("(") + 1
fragments = re.split(f"({regex_string})", text)
if fragments:
# ignoring text before first occurrence, not specified in requirements
if not fragments[0] in names: 
fragments = fragments[1:]
result = [
[name, text.rstrip()] 
for name, text in zip(
fragments[::group_count+1],
fragments[group_count::group_count+1]
)
]
pprint(result)

输出:

[['Monika', ' goes shopping. Then she rides bike.'],
['Dr. Mike', ' likes Pizza.'],
['Mrs. Monika', ' hates me.'],
['Henry', ' needs a break.']]

笔记:

  • 然后(Henry|Mike|(Mrs. )?Monika)最终正则表达式字符串

    • 例如。create_regex_string("Mrs. Monika")创造(Mrs. )?Monika
    • 它也适用于其他称呼(只要有一个空格将称呼与名称分开)
  • 因为我们在正则表达式中引入了额外的分组,所以fragments有更多的值

    • 因此,我们需要用zip更改行,使其动态
  • 如果你不想在result中称呼,你可以在创建result时使用name.split()[-1]

result = [
[name.split()[-1], text.rstrip()] 
for name, text in zip(
fragments[::group_count+1],
fragments[group_count::group_count+1]
)
]
# [['Monika', ' goes shopping. Then she rides bike.'],
#  ['Mike', ' likes Pizza.'],
#  ['Monika', ' hates me.'],
#  ['Henry', ' needs a break.']]

请注意:我还没有测试所有用例,因为我在休息时间更新了脚本。如果有问题,请告诉我,然后我会在下班时进行调查。

您的示例与所需的输出不完全匹配。此外,目前尚不清楚示例输入是否始终具有这种结构,例如在每个句子的末尾使用句点。

话虽如此,您可能想尝试这种肮脏的方法:

import re
text = 'Monika will go shopping. Mike likes Pizza. Monika hates me.'
names = ['Ruth', 'Mike', 'Monika']
rsplit = re.compile("|".join(sorted(names))).split
output = []
sentences = text.split(".")
for name in names:
for sentence in sentences:
if name in sentence:
output.append([name, f"{rsplit(sentence)[-1]}."])
print(output)

这输出:

[['Mike', ' likes Pizza.'], ['Monika', ' will go shopping.'], ['Monika', ' hates me.']]

这没有 re,除非你明确需要使用它。适用于给定的测试用例。

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
def sep(text, names):
foo = []
new_text = text.split(' ')
for i in new_text:
if i in names:
foo.append(new_text[:new_text.index(i)])
new_text = new_text[new_text.index(i):]
foo.append(new_text)
foo = foo[1:]
new_foo = []
for i in foo:
first, rest = i[0], i[1:]
rest = " ".join(rest)
i = [first, rest]
new_foo.append(i)
print(new_foo)
sep(text, names)

给出输出:

[['Monika', 'goes shopping. Then she rides bike.'], ['Mike', 'likes Pizza.'], ['Monika', 'hates me.']]

也应该适用于其他情况..

最新更新