>我有这个表达式:
<a class="a-link-normal" href="https://www.amazon.it/Philips-GC8735-PerfectCare-Generatore-Vapore/dp/B01J5FGW66/ref=gbph_img_s-3_7347_c3de3e94?smid=A11IL2PNWYJU7H&pf_rd_p=82ae57d3-a26a-4d56-b221-3155eb797347&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A11IL2PNWYJU7H&pf_rd_r=MDQJBKEMGBX38XMPSHXB" id="dealImage"></a>
我需要在"/dp/"(B01J5FGW66(旁边找到10个字母
如何制作一个执行此操作的函数?
使用正则表达式:
import re
s = '<a class="a-link-normal" href="https://www.amazon.it/Philips-GC8735-PerfectCare-Generatore-Vapore/dp/B01J5FGW66/ref=gbph_img_s-3_7347_c3de3e94?smid=A11IL2PNWYJU7H&pf_rd_p=82ae57d3-a26a-4d56-b221-3155eb797347&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A11IL2PNWYJU7H&pf_rd_r=MDQJBKEMGBX38XMPSHXB" id="dealImage"></a>'
print(re.search(r"dp/([A-Za-z0-9]{10})/", s)[1])
输出:B01J5FGW66
解释:
从"dp/"
开始:
dp/
捕获组由 (( 分隔,匹配 10(到 {10}(小写字母 (A-Z(、大写字母 (A-Z( 和数字 (0-9(:
([A-Za-z0-9]{10})
结束于"/"
:
/
使用re.search
我们可以在您的字符串s
中搜索该表达式,并使用[1]
访问第一个捕获组的结果。
请注意,您可能需要添加额外的代码,以防找不到匹配项:
m = re.search(r"dp/([A-Za-z0-9]{10})/", s)
if m is not None:
print(m[1])
else:
# if nothing is found, search return None
print("No match")
我假设你总是只想要dp旁边的斜杠之间有什么(下一个路线(,而这10个字符有点无关紧要。有点笨拙,但这有效:
>>> x = '<a class="a-link-normal" href="https://www.amazon.it/Philips-GC8735-PerfectCare-Generatore-Vapore/dp/B01J5FGW66/ref=gbph_img_s-3_7347_c3de3e94?smid=A11IL2PNWYJU7H&pf_rd_p=82ae57d3-a26a-4d56-b221-3155eb797347&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A11IL2PNWYJU7H&pf_rd_r=MDQJBKEMGBX38XMPSHXB" id="dealImage"></a>'
>>> splits = x.split("/")
>>> dp_index = splits.index('dp')
>>> result = splits[dp_index+1] # Get the next one over
>>> result
'B01J5FGW66'
要将其放入功能中,您可以这样做:
def get_route_next_to_dp(html_str):
splits = html_str.split("/")
dp_index = splits.index('dp')
result = splits[dp_index+1] # Get the next one over
return result
用法可能如下所示:
html_str = '<a class="a-link-normal" href="https://www.amazon.it/Philips-GC8735-PerfectCare-Generatore-Vapore/dp/B01J5FGW66/ref=gbph_img_s-3_7347_c3de3e94?smid=A11IL2PNWYJU7H&pf_rd_p=82ae57d3-a26a-4d56-b221-3155eb797347&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A11IL2PNWYJU7H&pf_rd_r=MDQJBKEMGBX38XMPSHXB" id="dealImage"></a>'
route_next_to_dp = get_route_next_to_dp(html_str)
print(route_next_to_dp)
输出
'B01J5FGW66'
如愿以偿。
试试这个:它基本上使用正则表达式并计算接下来的 10 个字符串并检查是否找到它。
import re
my_string='<a class="a-link-normal" href="https://www.amazon.it/Philips-GC8735-PerfectCare-Generatore-Vapore/dp/B01J5FGW66/ref=gbph_img_s-3_7347_c3de3e94?smid=A11IL2PNWYJU7H&pf_rd_p=82ae57d3-a26a-4d56-b221-3155eb797347&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A11IL2PNWYJU7H&pf_rd_r=MDQJBKEMGBX38XMPSHXB" id="dealImage"></a>'
m = re.search(r"dp/([A-Za-z0-9]{10})/", my_string)
if m.group(1):
print(m.group(1))