<form method="post" action="/M740/Biography/History/Drama/12+Years+a+Slave">
<input type="image" src="/public_site/webroot/cache/imdb/2024544_100.jpg" width="100" style="float:right;margin-left:2px;">
<strong><span style="color: rgb(255, 69, 0);">12 Years a Slave</span></strong>
<br>
In the antebellum United States, Solomon Northup, a free black man from upstate New York, is abducted and sold into slavery.<br>
<br><strong>Century Cinemax - Junction</strong><br>
<a href="tel:0774136246">0774136246</a>
<a href="tel:0208022073">0208022073</a>
<br>
12:10, 19:10, 21:40<br>
<br><strong>Fox Cineplex Sarit</strong><br>
<a href="tel:0203753025">0203753025</a>
<a href="tel:0720366208">0720366208</a>
<br>
11:00, 14:00, 18:00, 20:40<br>
<br><strong>Planet Media - Kisumu </strong><br>
<a href="tel:0731999100">0731999100</a>
<a href="tel:0724999100 & 0202629388">0724999100 & 0202629388</a>
<br>
12:00, 14:30, 20:30<br>
<br>
<input type="hidden" name="cinema" value="0">
<input type="hidden" name="searchMovie" value="0">
<input type="hidden" name="movie" value="740">
<input type="hidden" name="date" value="0">
<input type="hidden" name="groupId" value="0">
<input type="submit" name="ok" value="Further Details">
</form>
好吧,这只是我试图使用Nokogiri解析的HTML的一部分。HTML中的语义还不够,我经历了艰难的时间,可以与Nokogiri想要的内容。作为参考
到目前为止,我能够获得电影的标题,一个电影院和两个电话号码,但是通过我的方法,我无法真正获得所有内容
这是我正在使用
的当前脚本require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://flix.co.ke/Frontpage/Listings"
doc = Nokogiri::HTML(open(url))
doc.css(".min-width div form").each do |entry|
title = entry.at_css("span").text
puts title
cinema = entry.at_css("br+ strong").text
puts cinema
phone = entry.at_css("a").text
puts phone
puts entry.at_css("a").next_element.text
end
这样,我只能获得电影,one cinema
和two contact numbers
的title
,因此我的样本输出看起来像。
12 Years a Slave
Century Cinemax - Junction
0774136246
0208022073
47 Ronin 3D
Century Cinemax - Junction
0774136246
0208022073
Delivery Man
Century Cinemax - Junction
0774136246
0208022073
Frozen
Century Cinemax - Junction
0774136246
0208022073
(continued...)
标题之后有一个描述,就在休息标签之后,我无法得到它,以及如何循环浏览
标签内的所有电影院?还有电话号码和个人显示逗号分开的时间。
我只是不知道从哪里开始。我想在这种情况下取得这样的结果
12年的奴隶
在美国战前,所罗门·诺斯普(Solomon Northup
- Century Cinemax-交界处0774136246 020802207312:10,19:10,21:40
- 狐狸Cineplex sarit0203753025 072036620811:00,14:00,18:00,20:40
etc
任何帮助将不胜感激。预先感谢
这是可怕的html:/它无效,有451个错误和9个警告。没有语义,因此您必须依靠结构,这可能会改变,打破刮擦。
尽管如此,您可以使用同级方法来获取这些方法:
doc.css('.min-width div form').each do |node|
description = node.at_css('br').next_sibling.text
puts description.strip
puts '-'*10
end
# >> In the antebellum United States, Solomon Northup, a free black man from upstate New York, is abducted and sold into slavery.
# >> ----------
# >> A band of samurai set out to avenge the death and dishonor of their master at the hands of a ruthless shogun.
# >> ----------
# >> An affable underachiever finds out he's fathered 533 children through anonymous donations to a fertility clinic 20 years ago. Now he must decide whether or not to come forward when 142 of them file a lawsuit to reveal his identity.
# >> ----------
# >> Fearless optimist Anna teams up with Kristoff in an epic journey, encountering Everest-like conditions, and a hilarious snowman named Olaf in a race to find Anna's sister Elsa, whose icy powers have trapped the kingdom in eternal winter.
# >> ----------
# >> A medical engineer and an astronaut work together to survive after an accident leaves them adrift in space.
# >> ----------
# >> A pair of aging boxing rivals are coaxed out of retirement to fight one final bout -- 30 years after their last match.
# >> ----------
# >>
# >> ----------
# >> Harrison, overworked and underpaid is looking for money for bride price. A 'business' opportunity presents itself when he gets the keys to the Company house. With the CEO away on holiday, he has access to a vacant fully furnished house. He ...
# >> ----------
# >>
# >> ----------
# >> A chronicle of Nelson Mandela's life journey from his childhood in a rural village through to his inauguration as the first democratically elected president of South Africa.
# >> ----------
# >> Author P. L. Travers reflects on her difficult childhood while meeting with filmmaker Walt Disney during production for the adaptation of her novel, Mary Poppins.
# >> ----------
# >> The Manzoni family, a notorious mafia clan, is relocated to Normandy, France under the witness protection program, where fitting in soon becomes challenging as their old habits die hard.
# >> ----------
# >> The dwarves, along with Bilbo Baggins and Gandalf the Grey, continue their quest to reclaim Erebor, their homeland, from Smaug. Bilbo Baggins is in possession of a mysterious and magical ring.
# >> ----------
# >> The film begins as Katniss Everdeen has returned home safe after winning the 74th Annual Hunger Games along with fellow tribute Peeta Mellark. Winning means that they must turn around and leave their family and close friends, embarking on a ...
# >> ----------
# >> A day-dreamer escapes his anonymous life by disappearing into a world of fantasies filled with heroism, romance and action. When his job along with that of his co-worker are threatened, he takes action in the real world embarking on a global ...
# >> ----------
# >> Faced with an enemy that even Odin and Asgard cannot withstand, Thor must embark on his most perilous and personal journey yet, one that will reunite him with Jane Foster and force him to sacrifice everything to save us all.
# >> ----------
# >> A journey into the lives of a mother polar bear and her two seven-month-old cubs as they navigate the changing Arctic wilderness they call home.
# >> ----------
# >> See and feel what it was like when dinosaurs ruled the Earth, in a story where an underdog dino triumphs to become a hero for the ages.
# >> ----------
您使用css
而不是at_css
(例如,通过表单元素循环的方式)
html的确还不错,而您在正确的轨道上,这就是您要迭代的东西:
doc.search('.min-width div form').each do |form|
title = form.at('span').text
description = form.at('br').next.text
form.search('br + strong').each do |el|
cinema = el.text
phones = []
while next_el = el.at('+ a', '+ br + a')
el = next_el
phones << el.text
end
times = el.at('+ br').next.text
end
end