如何从IMDB网站上抓取制作公司的名称?



我需要刮掉制作公司的名称的一些电影。我一直尝试使用锚标记a和包含名称的类,但它不返回生产公司。

的URL: https://www.imdb.com/title/tt0473553/?ref_=fn_al_tt_1

这是我想要抓取的网站的HTML部分:

<section class="ipc-page-section ipc-page-section--base">
<div data-testid="title-details-section" class="styles__MetaDataContainer-sc-12uhu9s-0 cgqHBf">
<ul>
<li role="presentation" class="ipc-metadata-list__item ipc-metadata-list-item--link" data-testid="title-details-companies"><a class="ipc-metadata-list-item__label ipc-metadata-list-item__label--link" rel="" href="/title/tt0473553/companycredits?ref_=tt_dt_co" target="">Production companies</a>
<div class="ipc-metadata-list-item__content-container">
<ul class="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base" role="presentation">
<li role="presentation" class="ipc-inline-list__item">
<a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">IDT Entertainment</a>
</li>
<li role="presentation" class="ipc-inline-list__item">
<a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0142161?ref_=tt_dt_co_2">New Arc Entertainment</a>
</li>
</ul>
</div>
</li>
</ul>
</div>
</section>

这是我试过的:

import requests
from bs4 import BeautifulSoup
movie_url="https://www.imdb.com/title/tt0473553/?ref_=fn_al_tt_1"
movie_page = requests.get(movie_url)
soup = BeautifulSoup(page.text, 'html.parser')
#movies_comp = soup.find_all("li", class_="ipc-inline-list__item")
movies_comp = soup.find_all("a", class_="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link")
print(movies_comp)

我没有得到理想的输出。我希望它返回的输出是:

['IDT Entertainment', 'New Arc Entertainment']

你可以试试:

import requests
from bs4 import BeautifulSoup
page=requests.get("https://www.imdb.com/title/tt0473553/?ref_=fn_al_tt_1")
page="""
<section class="ipc-page-section ipc-page-section--base">
<div data-testid="title-details-section" class="styles__MetaDataContainer-sc-12uhu9s-0 cgqHBf">
<ul>
<li role="presentation" class="ipc-metadata-list__item ipc-metadata-list-item--link" data-testid="title-details-companies"><a class="ipc-metadata-list-item__label ipc-metadata-list-item__label--link" rel="" href="/title/tt0473553/companycredits?ref_=tt_dt_co" target="">Production companies</a>
<div class="ipc-metadata-list-item__content-container">
<ul class="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base" role="presentation">
<li role="presentation" class="ipc-inline-list__item">
<a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">IDT Entertainment</a>
</li>
<li role="presentation" class="ipc-inline-list__item">
<a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0142161?ref_=tt_dt_co_2">New Arc Entertainment</a>
</li>
</ul>
</div>
</li>
</ul>
</div>
</section>
"""
soup=BeautifulSoup(page,"lxml")
# To understand this is then structur of the data you want to extract :
# <li role="presentation" class="ipc-metadata-list__item ipc-metadata-list-item--link" data-testid="title-details-companies">
# <ul class="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base" role="presentation"><li role="presentation" class="ipc-inline-list__item"><a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">
# <a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">IDT Entertainment</a>
# <a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0142161?ref_=tt_dt_co_2">New Arc Entertainment</a>
print([a.text for a in soup.find("li",attrs={'class':r'ipc-metadata-list__item ipc-metadata-list-item--link','data-testid':r'title-details-companies'})
.find("ul",class_="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base")
.find_all("a")])

输出:

['IDT Entertainment', 'New Arc Entertainment']

<a>class,所以你得到了它们的多个

相关内容

最新更新