带分组的Regex.search未收集组



我正在尝试搜索以下列表

/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/

使用此代码:

next_page = re.compile(r'/(d+)_p/$')
matches = list(filter(next_page.search, href_search)) #search or .match
for match in matches:
#refining_nextpage = re.compile()
print(match.group())

并且得到以下错误:CCD_ 1。

我认为d+周围的括号会将一个或多个数字分组。我的目标是获得字符串末尾"_p/"之前的数字。

您正在过滤原始列表,因此返回的是原始字符串,而不是匹配对象。如果要返回匹配对象,则需要将搜索map添加到列表中,然后过滤匹配对象。例如:

next_page = re.compile(r'/(d+)_p/$')
matches = filter(lambda m:m is not None, map(next_page.search, href_search))
for match in matches:
#refining_nextpage = re.compile()
print(match.group())

输出:

/2_p/
/3_p/
/6_p/
/7_p/
/8_p/
/2_p/

如果您只想要匹配的数字部分,请使用match.group(1)而不是match.group()

我认为re.findall应该做到这一点:

next_page.findall(href_search)  # ['2', '3', '6', '7', '8', '2']

或者,您可以拆分行,然后单独搜索:

matches = []
for line in href_search.splitlines():
match = next_page.search(line)
if match:
matches.append(match.group(1))
matches  # ['2', '3', '6', '7', '8', '2']

你可以试试这个:

import re
# add re.M to match the end of each line
next_page = re.compile(r'/(d+)_p/$',  re.M)
matches = next_page.findall(href_search)
print(matches)

它给出:

['2', '3', '6', '7', '8', '2']

filter函数只会删除与正则表达式不匹配的行,并返回字符串,例如:

>>> example = ["abc", "def", "ghi", "123"]
>>> my_match = re.compile(r"d+$")
>>> list(filter(my_match.search, example))
['123']

如果你想要match对象,那么列表理解就可以了:

>>> example = ["abc", "def45", "67ghi", "123"]
>>> my_match = re.compile(r"d+$")
>>> [my_match.search(line) for line in example]  # Get the matches
[None,
<re.Match object; span=(3, 5), match='45'>,
None,
<re.Match object; span=(0, 3), match='123'>]
>>> [match.group() for match in [my_match.search(line) for line in example] if match is not None]  # Filter None values
['45', '123']

您可以执行regexAttributeError: 'str' object has no attribute 'group'0。参见regex101作为示例

说明:

(?<=/):查找/

d+:查找一个或多个数字

(?=_p/$):提前查找字符串末尾的_p/

如果匹配,则只返回d+值。

您可以编写代码以一次获取所有数据,也可以逐行迭代以获得所需的数据。

以下是两者的代码:

text_line = '''/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/'''
import re
for txt in text_line.split('n'):
t = re.findall(r'(?<=/)d+(?=_p/$)',txt)
print (t)
t = re.findall(r'(?<=/)d+(?=_p/)',text_line)
print (t)

第一部分一行一行地完成,第二部分的结果是一杆到位。

两者的输出均为:

逐行:

['2']
['3']
['6']
['7']
['8']
['2']

一次抓取所有:

['2', '3', '6', '7', '8', '2']

对于第二个,我没有给$符号,因为我们需要把它全部拿走。

最新更新