我正试图创建一个正则表达式来清理AmazonURL,但我无法删除中间部分。
从所附的例子中;组2";在最终结果中消失。有可能吗?
我使用这个正则表达式:^(?:http://|www.|https://)([^/]+)(s?.*)(/[dg]p/)([^/]+)
我会得到这样的结果:
https://www.amazon.com/adidas-Melange-Performance-T-Shirt-Charcoal/dp/B07P4LVZNL/ref=sr_1_fkmr1_2?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-2-fkmr1 --> https://www.amazon.com/dp/B07P4LVZNL
https://www.amazon.com/adidas-Originals-Solid-Melange-Purple/dp/B07DXPN7TK/ref=sr_1_fkmr2_1?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-1-fkmr2 --> https://www.amazon.com/dp/B07DXPN7TK
https://www.amazon.es/gp/B07R23QGH6/ref=sr_1_fkmr2_2?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-2-fkmr2 --> https://www.amazon.com/gp/B07R23QGH6
https://www.amazon.it/dp/B07R23QGH6/ --> https://www.amazon.it/dp/B07R23QGH6/
https://regex101.com/r/AFGk96/1
你已经逃命了。斜杠在正则表达式中没有意义,不需要转义:
^(?:http://|www.|https://)([^/]+)(s?.*)(/[dg]p/)([^/]+)
可以是(通过其他一些简化(
^(?:https?://)?(www[^/]+).*?(/[dg]p/[^/]+)
当我们将.*
添加到末尾以匹配字符串的尾部时,我们最终得到了一个有效的东西:
import re
amazon_url_pattern = re.compile(r'^(?:https?://)?(www[^/]+).*?(/[dg]p/[^/]+).*')
url = 'https://www.amazon.com/adidas-Melange-Performance-T-Shirt-Charcoal/dp/B07P4LVZNL/ref=sr_1_fkmr1_2?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-2-fkmr1'
result = amazon_url_pattern.sub(r'12/', url)
print(result)
打印
https://www.amazon.com/dp/B07P4LVZNL/