html抓取使用python topboxoffice列表从imdb网站

网址：http://www.imdb.com/chart/?ref_=nv_ch_cht_2

我想让你从上面的网站打印票房排行榜(所有电影的排名，标题，周末，总票房和周电影的顺序(

输出示例：
排名：1
标题：哥斯拉
周末：9320万美元
总额：9320万美元
周数：1
排名：2
title：邻居

这只是通过BeautifulSoup 提取这些实体的一种简单方法

from bs4 import BeautifulSoup                                          
import urllib2                                                         
url = "http://www.imdb.com/chart/?ref_=nv_ch_cht_2"                    
data = urllib2.urlopen(url).read()                                     
page = BeautifulSoup(data, 'html.parser')                              
rows = page.findAll("tr", {'class': ['odd', 'even']}) 
for tr in rows:             
    for data in tr.findAll("td", {'class': ['titleColumn', 'weeksColumn','ratingColumn']}):
        print data.get_text()

附言-根据你的意愿安排。

没有必要刮任何东西。请看我在这里给出的答案。

如何从imdb业务页面抓取数据？

下面的Python脚本将为您提供
1(IMDb的最佳票房电影列表
2(以及每部电影的演员名单。

from lxml.html import parse
def imdb_bo(no_of_movies=5):
    bo_url = 'http://www.imdb.com/chart/'
    bo_page = parse(bo_url).getroot()
    bo_table = bo_page.cssselect('table.chart')
    bo_total = len(bo_table[0][2])
    if no_of_movies <= bo_total:
        count = no_of_movies
    else:
        count = bo_total
    movies = {}
    for i in range(0, count):
        mo = {}
        mo['url'] = 'http://www.imdb.com'+bo_page.cssselect('td.titleColumn')[i][0].get('href')
        mo['title'] = bo_page.cssselect('td.titleColumn')[i][0].text_content().strip()
        mo['year'] = bo_page.cssselect('td.titleColumn')[i][1].text_content().strip(" ()")
        mo['weekend'] = bo_page.cssselect('td.ratingColumn')[i*2].text_content().strip()
        mo['gross'] = bo_page.cssselect('td.ratingColumn')[(i*2)+1][0].text_content().strip()
        mo['weeks'] = bo_page.cssselect('td.weeksColumn')[i].text_content().strip()
        m_page = parse(mo['url']).getroot()
        m_casttable = m_page.cssselect('table.cast_list')
        flag = 0
        mo['cast'] = []
        for cast in m_casttable[0]:
            if flag == 0:
                flag = 1
            else:
                m_starname = cast[1][0][0].text_content().strip()
                mo['cast'].append(m_starname)
        movies[i] = mo
    return movies

if __name__ == '__main__':
    no_of_movies = raw_input("Enter no. of Box office movies to display:")
    bo_movies = imdb_bo(int(no_of_movies))
    for k,v in bo_movies.iteritems():
        print '#'+str(k+1)+'  '+v['title']+' ('+v['year']+')'
        print 'URL: '+v['url']
        print 'Weekend: '+v['weekend']
        print 'Gross: '+v['gross']
        print 'Weeks: '+v['weeks']
        print 'Cast: '+', '.join(v['cast'])
        print 'n'

输出(在终端中运行(：

parag@parag-innovate:~/python$ python imdb_bo_scraper.py 
Enter no. of Box office movies to display:3
#1  Cinderella (2015)
URL: http://www.imdb.com/title/tt1661199?ref_=cht_bo_1
Weekend: $67.88M
Gross: $67.88M
Weeks: 1
Cast: Cate Blanchett, Lily James, Richard Madden, Helena Bonham Carter, Nonso Anozie, Stellan Skarsgård, Sophie McShera, Holliday Grainger, Derek Jacobi, Ben Chaplin, Hayley Atwell, Rob Brydon, Jana Perez, Alex Macqueen, Tom Edden

#2  Run All Night (2015)
URL: http://www.imdb.com/title/tt2199571?ref_=cht_bo_2
Weekend: $11.01M
Gross: $11.01M
Weeks: 1
Cast: Liam Neeson, Ed Harris, Joel Kinnaman, Boyd Holbrook, Bruce McGill, Genesis Rodriguez, Vincent D'Onofrio, Lois Smith, Common, Beau Knapp, Patricia Kalember, Daniel Stewart Sherman, James Martinez, Radivoje Bukvic, Tony Naumovski

#3  Kingsman: The Secret Service (2014)
URL: http://www.imdb.com/title/tt2802144?ref_=cht_bo_3
Weekend: $6.21M
Gross: $107.39M
Weeks: 5
Cast: Adrian Quinton, Colin Firth, Mark Strong, Jonno Davies, Jack Davenport, Alex Nikolov, Samantha Womack, Mark Hamill, Velibor Topic, Sofia Boutella, Samuel L. Jackson, Michael Caine, Taron Egerton, Geoff Bell, Jordan Long

相关内容

最新更新

热门标签：