我正在为学校做一个项目,我遇到了在网站的HTML中找到正确的CSS选择器的问题,以便拉入我正在寻找的数据。这也是我第一次使用网页抓取技术。我对Ruby也很陌生,所以如果这是一个愚蠢的问题,我很抱歉。
我已经成功地解析了第一组数据(虽然我确信有更好的方法来做到这一点,我的方法正在工作,但甚至对此的反馈也是受欢迎的):
网址:platinumgod.co.uk
我为第一部分收集的HTML如下(并列出第一个项目作为示例):
<div class="repentanceitems-container">
<h2>
"Repentance Items "
<span class="rep-item-ttl">(169)</span>
</h2>
<li class="textbox" data-tid="42.5" data-cid="42" data-sid="263">
<a
<div onclick class="item reb-itm-new re-itm263"></div>
<span>
<p class="item-title">Clear Rune</p>
<p class="r-itemid">ItemID: 263</p>
<p class="pickup">"Rune mimic"</p>
<p class="quality">Quality: 2</p>
<p>"When used, copies the effect of the Rune or Soul stone you are holding (like the Blank Card)"</p>
<p>Drops a random rune on the floor when picked up</p>
<p>The recharge time of this item depends on the Rune/Soul Stone held:</p>
<p>1 room: Soul of Lazarus</p>
<p>2 rooms: Rune of Ansuz, Rune of Berkano, Rune of Hagalaz, Soul of Cain</p>
<p>3 rooms: Rune of Algiz, Blank Rune, Soul of Magdalene, Soul of Judas, Soul of ???, Soul of the Lost</p>
<p>4 rooms: Rune of Ehwaz, Rune of Perthro, Black Rune, Soul of Isaac, Soul of Eve, Soul of Eden, Soul of the Forgotten, Soul of Jacob and Esau</p>
<p>6 rooms: Rune of Dagaz, Soul of Samson, Soul of Azazel, Soul of Apollyon, Soul of Bethany</p>
<p>12 rooms: Rune of Jera, Soul of Lilith, Soul of the Keeper</p>
<ul>
<p>Type: Active</p>
<p>Recharge time: Varies</p>
<p>Item Pool: Secret Room, Crane Game</p>
</ul>
<p class="tags">* Secret Room</p>
</span>
</a>
</li>
这只是悔改项目类别中的一个项目的示例,所以这是我解析该类别中每个项目的所有信息的代码:
# Repentance Items
repentance_items = []
html.at(".repentanceitems-container").css("li.textbox").each do |item |
item_name = item.css("a span p.item-title").text
item_id = item.css("a span p.r-itemid").text.sub(/^ItemID: /, "")
pickup_text = item.css("a span p.pickup").text.gsub(""", "")
quality = item.css("a span p.quality").text.sub(/^Quality: /, "")
use = item.css(".quality ~ p:not(.tags)").map { |row| row.text }
item_type = item.css("a span ul")
item.css("a span ul").each.map do |child|
item_type = child.css("p")[0].text.sub(/^Type: /, "")
if child.css("p")[1].text.match "Recharge time"
recharge_time = child.css("p")[1].text.sub(/^Recharge time: /, "")
item_pool = child.css("p")[2].text.sub(/^Item Pool: /, "").gsub(/,s*$/m, "").split(", ")
else
recharge_time = "N/A"
item_pool = child.css("p")[1].text.sub(/^Item Pool: /, "").gsub(/,s*$/m, "").split(", ")
end
repentance_items << {name: item_name, item_id: item_id, pickup_text: pickup_text, quality: quality, use: use, item_type: item_type, recharge_time: recharge_time, item_pool: item_pool}
end
end
我面临的问题是,当我试图刮下一个类别,这是忏悔项目饰品,我不确定CSS选择器应该是为了得到这个信息,因为很多相同的类在忏悔项目HTML &所以我得到和之前一样的东西。这些小饰品的HTML如下(以列出的第一项为例):
<div class="repentanceitems-container">
<h2>
"Repentance Trinkets "
<span class="a-item-ttl">(61)</span>
</h2>
<li class="textbox" data-tid="1000" data-cid="804" data-sid="10129">
<a
<div onclick class="item rep-item rep-trink rep-junxx129"></div>
<span>
<p class="item-title">Jawbreaker</p>
<p class="r-itemid">TrinketID: 129</p>
<p class="pickup">"Don't chew on it"</p>
<p>Tears have a chance to become a tooth, dealing x3.2 damage, similar to Tough Love</p>
<p>The chance to fire a tooth with this trinket is affected by your Luck stat</p>
<p>At +0 luck you have ~12% chance for this effect to activate</p>
<p>At +9 luck every tear you fire will be a tooth</p>
<p class="tags">*, </p>
</span>
</a>
</li>
我不知道从哪里开始,以便只选择这些项目。如果我使用相同的选择器在我的代码的第一部分,它显然只是重新拉入悔改项目&
希望我已经解释得足够好了,但请随时问我更多的问题& &;我会尽力解释更好。
提前感谢大家对我的帮助!
也许你可以开始打破你的第一个选择行在2部分:一个捕获容器,然后另一个寻找项目。这可能是这个样子(未测试):
repentance_items = []
repentance_trinklets = []
html.at(".repentanceitems-container").each do |container|
# Check to know in what category you are, so in which table to add the results, something like:
repentance_target = if container.css('h2').text =~ /items/i
repentance_items
else
repentance_trinklets
end
css("li.textbox").each do |item|
# your current logic
# affectation in the correct results array
repentance_target << ...
end
end
最后应该用正确的项填充两个数组
这是一个有点通风,但我希望这有帮助,让我知道如果有什么不清楚