当区分信息在后面的子属性中时,我如何获得父属性(例如链接)



使用硒(Python)以避免破坏足球比赛

我正试图从一个动态变化的网页上获取足球比赛回放视频的网址。网页会显示分数,我宁愿直接获取链接,也不愿访问几乎肯定会显示分数的网站。这场比赛还有其他相关的视频,比如10分钟的精彩片段。但我只想要完整的重播。

页面上有一个视频列表可供选择。但是,指示这是一个完整重播的"h1"标题被包裹在"a"标记中(见下文)。页面上有大约10个这样的列表项,但它们仅与"h1"的内容区分开来,后者以子项形式隐藏。我关注的文本是布伦特福德对利物浦:全场比赛。";完全匹配";一部分是赠品。

我的问题是,当以后的孩子收到重要信息时,我如何获得链接

<li data-sidebar-video="0_5de4sioh" class="js-subscribe-entitlement">
<a class="" href="//video.liverpoolfc.com/player/0_5de4sioh/">
<article class="video-thumb video-thumb--fade-in js-thumb video-thumb--no-duration video-thumb--sidebar">
<figure class="video-thumb__img">
<div class="site-loader">
<ul>
<li></li>
<li></li>
<li></li>
</ul>
</div> <img class="video-thumb__img-container loaded" data-src="//open.http.mp.streamamg.com/p/101/thumbnail/entry_id/0_5de4sioh/width/150/height/90/type/3" alt="Brentford v LFC : Full match" onerror="PULSE.app.common.VideoThumbError(this)" onload="PULSE.app.common.VideoThumbLoaded(this)"
src="//open.http.mp.streamamg.com/p/101/thumbnail/entry_id/0_5de4sioh/width/150/height/90/type/3" data-image-initialised="true"> <span class="video-thumb__premium">Premium</span> <i class="video-thumb__play-btn"></i> <span class="video-thumb__time"> <i class="video-thumb__icon"></i> 1:45:07 </span>        </figure>
<div class="video-thumb__txt-container"> <span class="video-thumb__tag js-video-tag">Match Action</span>
<h1 class="video-thumb__heading">Brentford v LFC : Full match</h1> <time class="video-thumb__date">25th Sep 2021</time> </div>
</article>
</a>
</li>

我的代码现在是这样的。它给了我一个链接列表,但我不知道哪个是哪个。

from selenium import webdriver
#------------------------Account login---------------------------#
#I have to login to my account first. 
#----------------------------------------------------------------#
username = "<my username goes here>"
password = "<my password goes here>"
username_object_id = "login_form_username"
password_object_id = "login_form_password"
login_button_name = "submitBtn"
login_url = "https://video.liverpoolfc.com/mylfctvgo"
driver = webdriver.Chrome("/usr/local/bin/chromedriver")
driver.get(login_url)
driver.implicitly_wait(10)
driver.find_element_by_id(username_object_id).send_keys(username)
driver.find_element_by_id(password_object_id).send_keys(password)
driver.find_element_by_name(login_button_name).click()
#--------------Find most recent game played----------------#
#I have to go to the matches section of my account and click on the most recent game
#----------------------------------------------------------------#
matches_url = "https://video.liverpoolfc.com/matches"
driver.get(matches_url)
driver.implicitly_wait(10)
latest_game = driver.find_element_by_xpath("/html/body/div[2]/section/ul/li[1]/section/div/div[1]/a").get_attribute('href')
driver.get(latest_game)
driver.implicitly_wait(10)
#--------------Find the full replay video----------------#
#There are many videos to choose from but I only want the full replay.
#--------------------------------------------------#
#prints all the videos in the list. They all have the same "data-sidebar-video" attribute 
web_element1 = driver.find_elements_by_css_selector('li[data-sidebar-video*=""] > a')
print(web_element1)
for i in web_element1:
print(i.get_attribute('href'))

您可以使用一个简单的XPath定位器来实现这一点,因为您是基于包含的文本进行搜索的。

//a[.//h1[contains(text(),'Full match')]]
^ an A tag
^ that has an H1 descendant
^ that contains the text "Full match"

注意:不能只从A标签中获取href,因为它不是一个完整的URL,例如//video.liverpoolfc.com/player/0_5de4sioh/。我建议你点击链接。如果你想把它写到一个文件中,你必须附加";https:";到这些部分URL的前面,以使它们可用。

您可以尝试如下。

提取具有li标签的视频列表,检查相应列表中的h1标签是否具有Full match,如果是,则获取具有hrefa标签。

# Imports Required:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver.get("https://video.liverpoolfc.com/player/0_5j5fsdzg/?contentReferences=FOOTBALL_FIXTURE%3Ag2210322&page=0&pageSize=20&sortOrder=desc&title=Highlights%3A%20Brentford%203-3%20LFC&listType=LIST-DEFAULT")
wait = WebDriverWait(driver,30)
wait.until(EC.visibility_of_element_located((By.XPATH,"//ul[contains(@class,'related-videos')]/li")))
videos = driver.find_elements_by_xpath("//ul[contains(@class,'related-videos')]/li")
for video in videos:
option = video.find_element_by_tag_name("h1").get_attribute("innerText")
if "Full match" in option:
link = video.find_element_by_tag_name("a").get_attribute("href")
print(f"{option} : {link}")
Brentford v LFC : Full match : https://video.liverpoolfc.com/player/0_5de4sioh/

您可以使用driver.execute_script只抓取具有";完全匹配";指定为儿童:

links = driver.execute_script('''
var links = [];
for (var i of document.querySelectorAll('li[data-sidebar-video*=""] > a')){
if (i.querySelector('h1.video-thumb__heading').textContent.endsWith('Full match')){
links.push(i.getAttribute('href'));
}
}
return links;
''')

这就是工作原理。我使用了@JeffC和@pmadhu的响应来获得稳定/有效的代码。我还添加了一个无头选项,这样你就可以在不必查看网页的情况下运行代码,这可能会无意中向你显示你试图避免的分数!结果,我不得不删除两行等待代码,我刚刚注释掉了,以防你想保留它

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
#------------------------Account login---------------------------#
#Logs into my account
#----------------------------------------------------------------#
username = "" #<----my username goes here
password = "" #<----my password goes here
username_object_id = "login_form_username"
password_object_id = "login_form_password"
login_button_name = "submitBtn"
login_url = "https://video.liverpoolfc.com/mylfctvgo"
#headless option is added so that this can operate in the background.
headless_option = webdriver.ChromeOptions()
headless_option.add_argument("headless")
driver = webdriver.Chrome("/usr/local/bin/chromedriver", options=headless_option)
driver.get(login_url)
driver.implicitly_wait(10)
driver.find_element_by_id(username_object_id).send_keys(username)
driver.find_element_by_id(password_object_id).send_keys(password)
driver.find_element_by_name(login_button_name).click()
#--------------Find most recent game played----------------#
#Clicks on the match section of my account and clicks on the most recent game
#----------------------------------------------------------------#
matches_url = "https://video.liverpoolfc.com/matches"
driver.get(matches_url)
driver.implicitly_wait(10)
latest_game = driver.find_element_by_xpath("/html/body/div[2]/section/ul/li[1]/section/div/div[1]/a").get_attribute('href')
driver.get(latest_game)
driver.implicitly_wait(10)
#--------------Find the full replay video----------------#
#There are many videos to choose from but I only want the full replay of the most recent game.
#--------------------------------------------------#
#institutes a maximum wait time for the page to load; I could have a slow connection one day. 
#wait = WebDriverWait(driver,30)
#wait.until(EC.visibility_of_element_located((By.XPATH,"//a[.//h1[contains(text(),'Full match')]]")))
#finds the full match link using an xpath search term, which is in the brackets
full_replay_xpath_element = driver.find_element_by_xpath("//a[.//h1[contains(text(),'Full match')]]")
#gets the value from the 'href' attribute 
full_match_link = full_replay_xpath_element.get_attribute('href')
#finds the game title so I know what match relates to link I'm getting. 
match_title = driver.find_element_by_xpath("//h1[contains(text(),'Full match')]")
#gets the value using innerText
match_title_innertext = match_title.get_attribute("innerText")
#prints both the game title and the link. 
print(f"{match_title_innertext} : {full_match_link}")
#An example output is: 
#Porto v LFC: Full match : https://video.liverpoolfc.com/player/0_i6064wb1/

最新更新