如何从IMDB数据库中提取电影的标题名称和评级?



我对python中的web刮削非常陌生。我想从IMDB数据库中提取电影名称、发行年份和评分。这是IMBD的网站,有250部电影和评级https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I使用模块,BeautifulSoup,并请求。这是我的代码

movies = bs.find('tbody',class_='lister-list').find_all('tr')

当我试图提取电影名称时,评级&年,我得到了相同的属性错误。

<td class="title column">
<a href="/title/tt11564570/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=ea4e08e1-c8a3-47b5-ac3a-75026647c16e&amp;pf_rd_r=BQWZRBFAM81S7K6ZBPJP&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=moviemeter&amp;ref_=chtmvm_tt_1" title="Rian Johnson (dir.), Daniel Craig, Edward Norton">Glass Onion: une histoire à couteaux tirés</a>
<span class="secondary info">(2022)</span>
<div class="velocity">1
<span class="secondary info">(
<span class="global-sprite telemeter up"></span>
1)</span>
<td class="ratingColumn imdbRating">
<strong title="7,3 based on 207 962 user ratings">7,3</strong>strong text

title = movies.find('td',class_='titleColumn').a.text
rating = movies.find('td',class_='ratingColumn imdbRating').strong.text
year = movies.find('td',class_='titleColumn').span.text.strip('()')

AttributeError Traceback(最近一次调用)& lt; ipython -输入- 9 - 2363 bafd916b>在& lt; module>——比;1 title = movies.find('td',class_='titleColumn').a.text2标题

~anaconda3libsite-packagesbs4element.py ingetattr(自我,键)2287 defgetattr(自我,键):引发一个有用的异常来解释一个常见的代码修复。"→2289 raise AttributeError(2290 &;ResultSet对象没有属性'% 5 '。您可能将元素列表视为单个元素。当你想调用find()时,你调用了find_all()吗?%的关键2291年 )

AttributeError: ResultSet对象没有属性'find'。您可能将元素列表视为单个元素。当您打算调用find()时,是否调用了find_all() ?

有人能帮我解决这个问题吗?提前感谢!

要获得ResultSets作为列表,您可以尝试下面的示例:

from bs4 import BeautifulSoup
import requests
import pandas as pd
data = []
res = requests.get("https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I")
#print(res)
soup = BeautifulSoup(res.content, "html.parser")
for card in soup.select('.chart.full-width tbody tr'):
data.append({
"title": card.select_one('.titleColumn a').get_text(strip=True),
"year": card.select_one('.titleColumn span').text,
'rating': card.select_one('td[class="ratingColumn imdbRating"]').get_text(strip=True)
})
df = pd.DataFrame(data)
print(df)
#df.to_csv('out.csv', index=False)

输出:

title       year rating
0                            Avatar: The Way of Water  (2022)    7.9
1                                         Glass Onion  (2022)    7.2
2                                            The Menu  (2022)    7.3
3                                         White Noise  (2022)    5.8
4                                   The Pale Blue Eye  (2022)    6.7
..                                                ...     ...    ...
95                                          Zoolander  (2001)    6.5
96                      Once Upon a Time in Hollywood  (2019)    7.6
97  The Lord of the Rings: The Fellowship of the Ring  (2001)    8.8
98                                     New Year's Eve  (2011)    5.6
99                            Spider-Man: No Way Home  (2021)    8.2
[100 rows x 3 columns]

更新:采用find_all and find方法提取数据。

from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0'}
data = []
res = requests.get("https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I")
#print(res)
soup = BeautifulSoup(res.content, "html.parser")
for card in soup.table.tbody.find_all("tr"):
data.append({
"title": card.find("td",class_="titleColumn").a.get_text(strip=True),
"year": card.find("td",class_="titleColumn").span.get_text(strip=True),
'rating': card.find('td',class_="ratingColumn imdbRating").get_text(strip=True)
})
df = pd.DataFrame(data)
print(df)

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

find_all返回一个数组,表示movies是一个数组。您需要使用for movie in movies:

对数组进行迭代
for movie in movies:
title = movie.find('td',class_='titleColumn').a.text
rating = movie.find('td',class_='ratingColumn imdbRating').strong.text
year = movie.find('td',class_='titleColumn').span.text.strip('()')

最新更新