我有这样的链接:
<div class="zg_title">
<a href="https://rads.stackoverflow.com/amzn/click/com/B000O3GCFU" rel="nofollow noreferrer">Thermos Foogo Leak-Proof Stainless St...</a>
</div>
我把它们刮成这样:
product_asin = product.xpath('//div[@class="zg_title"]/a/@href').first.value
问题是它占用了整个URL,我只想得到ID:
B000O3GCFU
我想我需要做这样的事情:
product_asin = product.xpath('//div[@class="zg_title"]/a/@href').first.value[ReGEX_HERE]
在这种情况下,我能使用的最简单的正则表达式是什么?
编辑:
奇怪的链接URL没有显示完整:
http://www.amazon.com/Thermos-Foogo-Leak-Proof-Stainless-10-Ounce/dp/B000O3GCFU/ref=zg_bs_baby-products_1
使用/w+$/
:
p doc.xpath('//div[@class="zg_title"]/a/@href').first.value[/w+$/]
/w+$/
匹配后面的字母、数字和_
。
require 'nokogiri'
s = <<EOF
<div class="zg_title">
<a href="http://rads.stackoverflow.com/amzn/click/B000O3GCFU">Thermos Foogo Leak-Proof Stainless St...</a>
</div>
EOF
doc = Nokogiri::HTML(s)
p doc.xpath('//div[@class="zg_title"]/a/@href').first.value[/w+$/]
# => "B000O3GCFU"
假设产品代码前面总是/dp/
,后面是/
:
url[/(?<=/dp/)[^/]+/]
或者,也许更可读:
url[%r{(?<=/dp/)[^/]+}]
或者,不使用正则表达式:
parts = url.split('/')
parts[parts.index('dp') + 1]
一种基于可用解析器的方法(为了取悦Nicolas Tyler或其他在这种情况下宁愿避免使用regex进行解析的人)
require 'uri'
product_uri = product.xpath('//div[@class="zg_title"]/a/@href').first.value
# e.g. http://www.amazon.com/Thermos-Foogo-Leak-Proof-Stainless-10-Ounce/dp/B000O3GCFU/ref=zg_bs_baby-products_1
product_path = URI.parse( product_asin_uri ).path.split('/')
# => ["", "Thermos-Foogo-Leak-Proof-Stainless-10-Ounce",
# "dp", "B000O3GCFU", "ref=zg_bs_baby-products_1"]
# This relies on (un-researched assumption) location in path being consistent
# Now we have components though, we can look at Amazon's documentation and
# select based on position in path, relative position from some other identifier
# etc, without risk of a regex mismatch
product_asin = product_path[2]
# => "B000O3GCFU"