使用Python从控制台捕获信息



我正在创建一个脚本,我正试图从一个网站专门撷取m4a文件。目前我正在使用BS4和selenium。

我在获取信息方面遇到了一些麻烦。文件链接不在该页的HTML源中。相反,我只能在控制台上找到它。我要获取的链接在这张图片(https://i.stack.imgur.com/5rUJH.jpg)中,标签为"audio_url_m4a:"

下面是我使用的一些示例代码:

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
d = DesiredCapabilities.CHROME
d['loggingPrefs'] = {'browser':'ALL ' }
driver = webdriver.Chrome(r'chromedriver path', desired_capabilities = d)
~~lots of code doing other things not relevant to the post~~
for URL in audm_URL: #this is referencing a line of code where I construct a list of URLs
driver.get(audm)
time.sleep(3)
for entry in driver.get_log('browser'):
print(entry)

下面是我得到的输出:


{'level': 'SEVERE', 'message': 'https://audm.herokuapp.com/favicon.ico - Failed to load resource: the server responded with a status of 404 (Not Found)', 'source': 'network', 'timestamp': 1611291689357}
{'level': 'SEVERE', 'message': 'https://cdn.segment.com/analytics.js/v1/5DOhLj2nIgYtQeSfn9YF5gpAiPqRtWSc/analytics.min.js - Failed to load resource: net::ERR_NAME_NOT_RESOLVED', 'source': 'network', 'timestamp': 1611291689357}

关于从控制台抓取东西的大多数问题都指向抓取日志,但似乎没有什么让我知道如何抓取其他变量。什么好主意吗?

这里有一个链接到一个随机的音频页面,我想从中抓取文件:https://audm.herokuapp.com/player-embed?pub=newyorker&正如= 5 fe0b9b09fabedf20ec1f70c

谢谢大家!

driver.get(
"https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"button"))).click()
src=WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".react-player video"))).get_attribute("src")

print(src)

如果你只想获取SRC,你可以使用上面的代码。

你需要导入

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

如果你想通过控制台日志得到它,然后使用:it SEEMS ITS WORKING ONLY FOR HEADLESS I正在调查:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
capabilities = webdriver.DesiredCapabilities().CHROME.copy()
capabilities['loggingPrefs'] = {'browser': 'ALL'}
driver = webdriver.Chrome(options=options,desired_capabilities=capabilities)
driver.maximize_window()

time.sleep(3)
driver.get(
"https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c")

for entry in driver.get_log('browser'):
print(entry)

在headless模式下w3c为false,因此它正在工作,

对于非headless模式,您必须使用:

options.add_experimental_option('w3c', False)

这招奏效了。我用错误的方式看待它,并没有试图获得src。谢谢你的建议!

最新更新