从字符串中删除"page + some_number"的所有实例

我有字符串，其中包含"Page 2"格式的页码。我想删除这些页码。

字符串可能是：

"第一个是第 10 页，然后是第1 页，然后是第 12 页">

当前实现：

有没有比下面更优雅的方法来删除所有"页面#{some_number}"？

page_numbers = [
'Page 1', 
'Page 2', 
'Page 3', 
'Page 4', 
'Page 5', 
'Page 6', 
'Page 7', 
'Page 8', 
'Page 9',
'Page 10',
'Page 11',
'Page 12']
x = "The first is Page 10 and then Page 1 and then Page 12"
for v in page_numbers:
x = x.replace(v, ' ')
print(x)

这应该可以做到，使用 re 模块：

>>> import re
>>> x = "The first is Page 10 and then Page 1 and then Page 12"
>>> re.sub(r'(s?Page d{1,3})', ' ', x)
'The first is  and then  and then '

re.sub将正则表达式的所有匹配项替换为 x 上的第二个参数(替换字符串)(第三个参数)

那么，正则表达式在做什么呢？

s?只是吃掉第 n 页文本之前的空格(如果它在那里)
Page与"Page "字符串完全匹配(带空格)
d{1,3}匹配 1 到 3 位数字。如果您只能处理到 99，请使用d{1,2}。如果您需要更多，只需调整即可。

re.sub的答案是正确的，但不完整。如果您只想删除某些页码，那么仅靠一个简单的re.sub解决方案是不够的。您需要提供回调才能使其正常工作。

p_set = set(page_numbers)
def replace(m):
p = m.group()
return ' ' if p in p_set else p

现在，将replace作为回调传递给re.sub-

>>> re.sub('Page d+', replace, x)
'The first is   and then   and then  '

re.sub的第二个参数接受回调，在找到匹配项时调用该回调。相应的match对象作为参数传递给replace，这应该返回一个替换值。

我还page_numbers转换为set.这使我在确定是保留还是丢弃匹配的字符串时，对p_set执行恒定时间查找。

为了获得更大的灵活性，您可以支持删除某个范围内的页码 -

def replace(m):
return ' ' if int(m.group(1)) in range(1, 13) else m.group()

并恰当地称呼它——

>>> re.sub('Page (d+)', replace, x)
'The first is   and then   and then  '

这比维护一个列表/一组页码更有效，假设您的删除范围是连续的。需要注意的另一件事是，使用in运算符对range对象进行成员资格检查在计算上是廉价的(常量时间)。

你可以像这样使用正则表达式来做到这一点：

import re
x ="The first is Page 10 and then Page 1 and then Page 12"
print(re.sub(r'Page d+', '', x))

这将查找所有"Page"，后跟空格和任意数量的数字，并将其替换为任何内容。

如果您想在单词之间保持间距，请执行以下操作：

re.sub(r'Pagesd+s', '', x)

这将匹配后面的空格并替换它，因为如果没有，您将有 2 个空格(一个来自 Page 之前，一个来自它之后)

相关内容

最新更新

热门标签：