如何在Python中用Selenium区分具有相同相对XPATH的两个表



我正试图从IMDb(使用Python中的selenium(中抓取一些数据,但我遇到了一个问题。每部电影我都要找导演和编剧。两个元素都包含在两个表中,并且它们具有相同的@class。当我刮的时候,我需要区分这两个表,否则有时程序可能会让一个作家担任导演,反之亦然。

我尝试使用相对XPATH来查找具有该xpath的所有元素(表(,然后将它们放在循环中,在循环中我尝试通过表标题(即h4元素(和preceding-sibling函数来区分它们。代码可以工作,但它找不到任何东西(每次返回nan(。

这是我的代码:

counter = 1
try:
driver.get('https://www.imdb.com/title/' + tt + '/fullcredits/?ref_=tt_cl_sm')
ssleep()
tables = driver.find_elements(By.XPATH, '//table[@class="simpleTable simpleCreditsTable"]/tbody')
counter = 1
for table in tables:
xpath_table = f'//table[@class="simpleTable simpleCreditsTable"]/tbody[{counter}]' 
xpath_h4 = xpath_table + "/preceding-sibling::h4[1]/text()"
table_title = driver.find_element(By.XPATH, xpath_h4).text
if table_title == "Directed by":
rows_director = table.find_elements(By.CSS_SELECTOR, 'tr')
for row in rows_director:
director = row.find_elements(By.CSS_SELECTOR, 'a')
director = [x.text for x in director]
if len(director) == 1:
director = ''.join(map(str, director))
else:
director = ', '.join(map(str, director))
director_list.append(director)
counter += 1
except NoSuchElementException:
# director = np.nan
director_list.append(np.nan)

你们谁能告诉我为什么它不起作用吗?也许还有更好的解决方案。我希望得到你的帮助。

(在这里,您可以找到我需要刮取的页面的示例:https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm)

要在imdb.com中提取每部电影的名称、导演和编剧,您必须诱导WebDriverWait等待可见性_of_all_elements_located((,您可以使用以下定位器策略

  • 使用CSS_SELECTOR

    driver.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm")
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h4#director +table > tbody tr > td > a")))])
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h4#writer +table > tbody tr > td > a")))])
    
  • 使用XPATH:

    driver.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm")
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4[@id='director']//following::table[1]/tbody//tr/td/a")))])
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4[@id='writer']//following::table[1]/tbody//tr/td/a")))])
    
  • 控制台输出:

    ['Matt Reeves']
    ['Matt Reeves', 'Peter Craig', 'Bill Finger', 'Bob Kane']
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

您可以使用DirectorsWritersh4标签的id属性来提取数据。

尝试如下:

# Imports Required
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
links = ["https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm","https://www.imdb.com/title/tt10234724/fullcredits/?ref_=tt_cl_sm",
"https://www.imdb.com/title/tt10872600/fullcredits?ref_=tt_cl_wr_sm","https://www.imdb.com/title/tt1160419/fullcredits?ref_=tt_cl_wr_sm"]
for link in links:
driver.get(link)
wait = WebDriverWait(driver,20)

# Get the name of the movie
name = wait.until(EC.presence_of_element_located((By.XPATH,"//h3[@itemprop='name']/a"))).text

# Get the Directors
directors = driver.find_elements(By.XPATH,"//h4[@id='director']/following-sibling::table[1]//tr")
dir_list = []
for director in directors:
# Add the director names in the list. You can format the unwanted string using replace.
dir_list.append(director.text)
# Get the Writers
writers = driver.find_elements(By.XPATH,"//h4[@id='writer']/following-sibling::table[1]//tr")
wri_list = []
for writer in writers:
# Add the Writer names in the list. You can format the unwanted string using replace.
wri_list.append(writer.text)
# Print the data.
print(f"Name of the movie: {name}")
print(f"Directors : {dir_list}")
print(f"Writers : {wri_list}")

输出:

Name of the movie: The Batman
Directors : ['Matt Reeves ... (directed by)']
Writers : ['Matt Reeves ... (written by) &', 'Peter Craig ... (written by)', ' ', 'Bill Finger ... (Batman created by) &', 'Bob Kane ... (Batman created by)']
Name of the movie: Moon Knight
Directors : ['Justin Benson ... (5 episodes, 2022)', 'Mohamed Diab ... (5 episodes, 2022)', 'Aaron Moorhead ... (5 episodes, 2022)']
Writers : ['Danielle Iman ... (staff writer) (6 episodes, 2022)', 'Doug Moench ... (characters) (6 episodes, 2022)', 'Doug Moench ... (creator) (6 episodes, 2022)', 'Don Perlin ... (characters) (6 episodes, 2022)', 'Jeremy Slater ... (created for television by) (6 episodes, 2022)', 'Jeremy Slater ... (6 episodes, 2022)', 'Peter Cameron ... (written by) (2 episodes, 2022)', 'Sabir Pirzada ... (written by) (2 episodes, 2022)', 'Beau DeMayo ... (written by) (1 episode, 2022)', 'Michael Kastelein ... (written by) (1 episode, 2022)', 'Alex Meenehan ... (written by) (1 episode, 2022)', 'Jack Kirby ... (Based on the Marvel comics by) (unknown episodes)', 'Stan Lee ... (Based on the Marvel comics by) (unknown episodes)']
Name of the movie: Spider-Man: No Way Home
Directors : ['Jon Watts']
Writers : ['Chris McKenna ... (written by) &', 'Erik Sommers ... (written by)', ' ', 'Stan Lee ... (based on the Marvel comic book by) and', 'Steve Ditko ... (based on the Marvel comic book by)']
Name of the movie: Dune
Directors : ['Denis Villeneuve ... (directed by)']
Writers : ['Jon Spaihts ... (screenplay by) and', 'Denis Villeneuve ... (screenplay by) and', 'Eric Roth ... (screenplay by)', ' ', 'Frank Herbert ... (based on the novel Dune written by)']

由于它是静态页面内容,您甚至不需要硒元素。您可以使用轻量级的python请求模块和Bs4。这只是另一种方法。

import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm")
result=res.text
soup=BeautifulSoup(result, 'html.parser')
directors=[director.text.strip() for director in soup.select("h4#director+table tr td.name>a")]
writers=[writer.text.strip() for writer in soup.select("h4#writer+table tr td.name>a")]
print(directors)
print(writers)

输出:

['Matt Reeves']
['Matt Reeves', 'Peter Craig', 'Bill Finger', 'Bob Kane']

最新更新