如果原始字符串包含正斜杠,如何返回具有不同可能性的新字符串



这里有一个我想要实现的示例:

输入:

example = "bear a/little/no resemblance to sth/sb/whatever"

输出:

alternatives = ['bear a resemblance to sth',
'bear a resemblance to sb',
'bear a resemblance to whatever',
"bear little resemblance to sth",
"bear little resemblance to sb",
"bear little resemblance to whatever",
"bear no resemblance to sth",
"bear no resemblance to sb",
"bear no resemblance to whatever",
]

这是另一个例子:

输入:

example = "beat about/around the bush"

输出:

alternatives = ['beat about the bush',
'beat around the bush'
]

还有一个

输入:

example = "become available/rich/a writer, etc."

输出:

alternatives = ['become available',
'become rich',
'become a writer']

我得到了一个英文句子,它可能包含一个正斜杠,在这种情况下,正斜杠的意思是OR。因此,如果我在示例字符串中发现一个正斜杠,我需要返回一个新字符串,在/的每一侧都有两个单词

句子可以包含任意数量的/或可以不包含。

编辑

我用下面的代码实现了预期的结果,但我觉得这远不是Python的,如果有人能提出一种更Python的方法来解决这个问题,我将不胜感激。

alt  = [] #short for alternatives
# multiple cleaning stages for every baseword
##1## remove ', etc.'
a = example.replace(', etc.', '')
##2## does this string have / / in it
regex = re.compile(r'(w+)/(w+)/(w+)')
match = regex.search(a)
delete_later = [] #a list with sentence to delete later from alt as it cleans up old used sentences
if match:
part = a.partition(match.group(0))
s1 = part[0]+match.group(1)+part[2]
s2 = part[0]+match.group(2)+part[2]
s3 = part[0]+match.group(3)+part[2]
alt.append(s1)
alt.append(s2)
alt.append(s3)
#check again:
for _ in range(10):
for item in alt:
regex = re.compile(r'(w+)/(w+)/(w+)')
match = regex.search(item)
if match:
delete_later.append(item)
part = item.partition(match.group(0))
s1 = part[0]+match.group(1)+part[2]
s2 = part[0]+match.group(2)+part[2]
s3 = part[0]+match.group(3)+part[2]
alt.append(s1)
alt.append(s2)
alt.append(s3)
#clean up
for i in delete_later:
try:
#avoid Traceback: ValueError: list.remove(x): x not in list
alt.remove(i)
except:
pass

##3## does this string have / in it
if len(alt) > 0:
for _ in range(10):
for item in alt:
regex = re.compile(r'(w+)/(w+)')
match = regex.search(item)
if match:
delete_later.append(item)
part = item.partition(match.group(0))
s1 = part[0]+match.group(1)+part[2]
s2 = part[0]+match.group(2)+part[2]
alt.append(s1)
alt.append(s2)
#clean up
for i in delete_later:
try:
#avoid Traceback: ValueError: list.remove(x): x not in list
alt.remove(i)
except:
pass
#else:
#check for the 1st time
regex = re.compile(r'(w+)/(w+)')
match = regex.search(a)
delete_later = [] #a list with sentence to delete later from alt as it cleans up old used sentences
if match:
part = a.partition(match.group(0))
s1 = part[0]+match.group(1)+part[2]
s2 = part[0]+match.group(2)+part[2]
alt.append(s1)
alt.append(s2)
#check again:
for _ in range(10):
for item in alt:
regex = re.compile(r'(w+)/(w+)')
match = regex.search(item)
if match:
delete_later.append(item)
part = item.partition(match.group(0))
s1 = part[0]+match.group(1)+part[2]
s2 = part[0]+match.group(2)+part[2]
alt.append(s1)
alt.append(s2)
#clean up
for i in delete_later:
try:
#avoid Traceback: ValueError: list.remove(x): x not in list
alt.remove(i)
except:
pass

for i,e in enumerate(alt , 1):
print(i,e)

您可以使用此regexp捕获您的案例:"(w+/)+(.+, etc.)|(w+/)+w+":

  • (w+/)+是以/结尾的选项字符串的第一部分
  • 结尾由两种不同的情况覆盖,(.+, etc.)w+

完整代码:

import re
from pprint import pprint

def get_options(s):
# removal of custom delimiters such as "etc" and splitting
return s.replace(", etc.", "").split("/")

def split(s):
result = re.search("(w+/)+(.+?, etc.)|(w+/)+w+", s)
if result:
result = result.group()
return [s.replace(result, r) for r in get_options(result)]
else:
return [s]

examples = [
"bear a/little/no resemblance to sth/sb/whatever",
"beat about/around the bush",
"become available/rich/a writer, etc.",
"(the) most attractive/important/popular, etc. a dance/language/riding, etc. school",
]
n = 0
while len(examples) > n:
n = len(examples)
result = []
for s in examples:
result.extend(split(s))
examples = result
pprint(examples)

最新更新