我试图给路径的变量,以便我可以刮在该路径中包含的信息.但是,我越来越空列表



我正在尝试使用Python和我在这里使用的基本概念是,

create empty list——>使用'for loop'循环遍历网页上的元素。——比;将该信息附加到空列表中——>使用pandas——>将列表转换为行和列最后一个csv。

我写的代码是
import requests 
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
headers = {"Accept-Language": "en-US, en;q=0.5"}
url = "https://www.imdb.com/find?q=top+1000+movies&ref_=nv_sr_sm"
results=requests.get(url,headers=headers)
soup=BeautifulSoup(results.text,"html.parser")
# print(soup.prettify())
#initializing empty lists where the data will go
titles =[]
years = []
times = []
imdb_rating = []
metascores = []
votes = []
us_gross = []
movie_div = soup.find_all('div',class_='lister-list')
#initiating the loop for scraper 
for container in movie_div:
#tiles 
name=container.tr.td.a.text
titles.append(name)
print(titles)

我想废弃的网站是'https://www.imdb.com/chart/top/?ref_=nv_mv_250'。我需要帮助知道我怎么能给正确的路径变量'名称',这样我就可以提取在name_of_movei给出的电影的名称,在页面的HTML脚本。因为每次我得到的输出都是空列表

下面的示例将从表中解析name,year,rating,并从中创建一个数据帧:

import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.imdb.com/chart/top/"
headers = {"Accept-Language": "en-US, en;q=0.5"}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
all_data = []
for row in soup.select(".lister-list > tr"):
name = row.select_one(".titleColumn a").text.strip()
year = row.select_one(".titleColumn .secondaryInfo").text.strip()
rating = row.select_one(".imdbRating").text.strip()
# ...other variables
all_data.append([name, year, rating])

df = pd.DataFrame(all_data, columns=["Name", "Year", "Rating"])
print(df.head().to_markdown(index=False))

打印:

评级《肖申克的救赎》(1994)9.2《教父》(1972)9.2(2008)12怒汉》(1957)8.9

最新更新