需要帮助使用 BeautifulSoup 和 Python3 解析来自 iframe 的链接



我这里有这个网址,我正在尝试获取视频的源链接,但它位于iframe中。视频网址https://ndisk.cizgifilmlerizle.com...在名为vjs_iframe的 iframe 中。我的代码如下:

import requests
from bs4 import BeautifulSoup
url = "https://m.wcostream.com/my-hero-academia-season-4-episode-5-english-dubbed"
r = requests.Session() 
headers = {"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0"} # Noticed that website responds better with headers
req = r.get(url, headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
iframes = soup.find_all("iframe") # Returns an empty list
vjs_iframe = soup.find_all(class_="vjs_iframe") # Also returns an empty list

我不知道如何在 iframe 中获取 url,因为即使是 iframe 的源也不会在第一次请求时加载。是否可以使用BeautifulSoup获取https://ndisk.cizgifilmlerizle.com...URL,或者我是否需要使用另一个库(如selenium或其他库(?提前感谢!

我抓取他们东西的方法如下。 Idk,如果你不再需要这个,但我正在寻找那个https://ndisk.cizgifilmlerizle.com网站的问题,并看到了这个。 认为它可能会帮助其他人。 这很粗糙,但可以完成工作。

import time
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.options import Options
from selenium.webenter code heredriver.common.by import By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from time import sleep
import os
import string

#   tab 5, space, up arrow 2, space
def do_keys(key, num_times, action_chain):
for x in range(num_times):
action_chain.send_keys(key)

def cls():
print("33[2J")

# Press the green button in the gutter to run the script.
if __name__ == '__main__':
count = 274
# Stuck on 274 - 500.  273 also failed.
attempts = 0
while count < 501:
url = f"https://www.wcostream.com/naruto-shippuden-episode-{count}"
video_dir = f"{os.path.dirname(os.path.realpath(__file__))}\videos\"
default_video_name = f"{video_dir}getvid.mp4"
if not os.path.exists(video_dir):
os.mkdir(video_dir)
options = Options()
options.add_argument('--headless')
options.add_argument('--mute-audio')
options.add_experimental_option("prefs", {
"download.default_directory": video_dir,
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"safebrowsing.enabled": True
})
browser = webdriver.Chrome(options=options)
# browser = webdriver.Chrome()
browser.get(url)
sleep(1)
title_element = None
try:
title_element = browser.find_element(By.XPATH,
"//*[@id="content"]/table/tbody/tr/td[1]/table/tbody/tr/td/table[1]/tbody/tr[2]/td/div[2]/b/b[1]")
except Exception as e:
title_element = browser.find_element(By.XPATH,
"//*[@id="content"]/table/tbody/tr/td[1]/table/tbody/tr/td/table[1]/tbody/tr[2]/td/div[2]/b[2]")
title = title_element.text.lower().translate(str.maketrans('', '', string.punctuation)).replace(' ', '_')
new_video_name = f"{video_dir}episode_{count}_{title}.mp4"
cls()
print(f"Title: {title}")
# Below is working.
browser.switch_to.frame(browser.find_element(By.XPATH, "//*[@id="frameNewcizgifilmuploads0"]"))
results = browser.page_source
soup = BeautifulSoup(results, "html.parser")
video_url = soup.find("video").get("src")
print(f"URL:t{video_url}")
browser.get(video_url)
element = browser.find_element(By.TAG_NAME, "video")
sleep(1)
actions = ActionChains(browser)
actions.send_keys(Keys.SPACE)
actions.perform()
sleep(1)
do_keys(Keys.TAB, 5, actions)
do_keys(Keys.SPACE, 1, actions)
do_keys(Keys.UP, 2, actions)
do_keys(Keys.SPACE, 1, actions)
actions.perform()
start = time.time()
print(f"Downloading: {new_video_name}")
#
# # browser.get(video_url)
# print(browser)
#
# # print(results)
# print(f"{video_url}")
browser_open = True
timeout = 0
while browser_open:
if os.path.isfile(default_video_name):
if os.path.exists(new_video_name):
os.remove(default_video_name)
end = time.time()
print(f"Already Exists! [{end - start}s]")
else:
os.rename(default_video_name, new_video_name)
end = time.time()
print(f"Download complete! [{end - start}s]")
count += 1
browser_open = False
browser.close()
try:
_ = browser.window_handles
except Exception as e:
browser_open = False
if timeout > 50:
attempts += 1
print(f"Download Timed Out.  Trying again. [{attempts}]")
browser_open = False
browser.close()
else:
attempts = 0
timeout += 1
sleep(1)

这个网站非常棘手:iframe 是在<meta itemprop="embedURL">之后从半混淆代码生成的(我正在格式化它(:

<script>var nUk = ""; var BBX = ["RnF6Mjk4NzU0MkRXcw==", "Tnl0Mjk4NzU4N3NidA==", 
// TONS OF STRINGS IN THE ARRAY
];
BBX.forEach(function EtT(value) {
nUk += String.fromCharCode(
parseInt(atob(value).replace(/D/g,'')) - 2987482); 
});
document.write(decodeURIComponent(escape(nUk)));</script>

变量名称是自动生成的,如果您重新加载页面,它们会更改,但混淆技术是相同的。该数组的每个段(valueforEach循环中(都包含一个模糊字符,事情是这样的:

  • 它解码 base64 字符串 (atob(
  • 删除所有非数字字符(replace(,因此您有一个数字
  • 减去2987482(另一个自动生成的数字,每次请求都会更改(
  • 将其转换为字符(fromCharCode调用(
  • 合并所有字符

如果您能够在 Firefox/Chromium 控制台中执行该代码,请省略document.write()并在控制台中打印变量:您可以看到该 iframe 代码,然后通过document.write()调用将其注入页面。

你应该能够将该javascript传递给解释器,以获取iframe内容并捕获URL,然后您可以抓取该URL。

要在python中抓取这个网站,你应该有一些javascript解释器,或者非常努力地使用正则表达式,例如soup.find(string=re.compile('.*atob(')),然后执行与浏览器中的javascript相同的操作。这真的是矫枉过正,你应该只出于学习目的这样做。 如果你的任务是下载iframe的东西,也许更容易找到另一个网站。

如果您能够安装该库,我建议使用lxml解析器。此外,我真的推荐刮擦,这是一个华丽的软件。

最新更新