使用旧的不受支持的Internet Explorer浏览器对网站进行Web抓取



我正在尝试抓取以下网站(https://iltacon2022.expofp.com/)我一直收到以下错误(完整输出打印如下(。我不确定问题出在哪里,我想知道是否有人能帮我。

if (window.navigator.userAgent.indexOf("Trident/") !== -1) {
alert("Your are using old unsupported Internet Explorer browser.nPlease upgrade to view this page properly."

我尝试过使用selenium和requests模块,但无论哪种方式,我似乎都遇到了同样的问题。

代码试用:

from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
import random
import requests
options = Options()
options.headless = False
driver = webdriver.Firefox(options=options)
url = "https://iltacon2022.expofp.com/"
driver.get(url)
time.sleep(6)
soup = bs(driver.page_source, 'lxml')
driver.quit()
print(soup)

输出:

<html lang="en"><head>
<meta charset="utf-8"/>
<link href="https://iltacon2022.expofp.com/packages/master/favicon.png" rel="shortcut icon"/>
<meta content="user-scalable=no, initial-scale=1.0, maximum-scale=1.0, width=device-width" name="viewport"/>
<!-- <meta name="theme-color" content="#000000" /> -->
<title>ILTACON2022 – Gaylord National Resort and Convention Center | August 22–25, 2022 | Monday – Thursday – Expo Floor Plan by ExpoFP</title>
<script>
if (window.navigator.userAgent.indexOf("Trident/") !== -1) {
alert("Your are using old unsupported Internet Explorer browser.nPlease upgrade to view this page properly.");
}
</script>
<style>
html,
body {
touch-action: none;
margin: 0;
padding: 0;
height: 100%;
width: 100%;
background: #ebebeb;
position: fixed;
overflow: hidden;
}
@media (max-width: 820px) and (min-width: 500px) {
html {
font-size: 13px;
}
}
</style>
<style>
.lds-grid {
top: 42vh;
margin: 0 auto;
display: block;
position: relative;
width: 64px;
height: 64px;
}
.lds-grid div {
position: absolute;
width: 13px;
height: 13px;
background: #aaa;
border-radius: 50%;
/* border: solid 1px #fff; */
animation: lds-grid 1.2s linear infinite;
}
.lds-grid div:nth-child(1) {
top: 6px;
left: 6px;
animation-delay: 0s;
}
.lds-grid div:nth-child(2) {
top: 6px;
left: 26px;
animation-delay: -0.4s;
}
.lds-grid div:nth-child(3) {
top: 6px;
left: 45px;
animation-delay: -0.8s;
}
.lds-grid div:nth-child(4) {
top: 26px;
left: 6px;
animation-delay: -0.4s;
}
.lds-grid div:nth-child(5) {
top: 26px;
left: 26px;
animation-delay: -0.8s;
}
.lds-grid div:nth-child(6) {
top: 26px;
left: 45px;
animation-delay: -1.2s;
}
.lds-grid div:nth-child(7) {
top: 45px;
left: 6px;
animation-delay: -0.8s;
}
.lds-grid div:nth-child(8) {
top: 45px;
left: 26px;
animation-delay: -1.2s;
}
.lds-grid div:nth-child(9) {
top: 45px;
left: 45px;
animation-delay: -1.6s;
}
@keyframes lds-grid {
0%,
100% {
opacity: 1;
}
50% {
opacity: 0.5;
}
}
</style>
<link as="script" href="https://iltacon2022.expofp.com/data/data.js" rel="preload"/>
<link as="script" href="https://iltacon2022.expofp.com/data/fp.svg.js" rel="preload"/>
<link as="script" href="https://iltacon2022.expofp.com/packages/master/floorplan.js" rel="preload"/>
<link as="script" href="https://iltacon2022.expofp.com/packages/master/vendors~floorplan.js" rel="preload"/>
<link as="style" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/css/fontawesome-all.min.css" rel="preload"/>
<link as="style" href="https://iltacon2022.expofp.com/packages/master/vendor/sanitize-css/sanitize.css" rel="preload"/>
<link as="style" href="https://iltacon2022.expofp.com/packages/master/vendor/perfect-scrollbar/css/perfect-scrollbar.css" rel="preload"/>
<!-- Fonts are anonymous because those will be loaded with FontFace -->
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/webfonts/fa-regular-400.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/webfonts/fa-solid-900.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/vendor/fa/webfonts/fa-light-300.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/fonts/oswald-v17-cyrillic_latin-500.woff2" rel="preload"/>
<link as="font" crossorigin="anonymous" href="https://iltacon2022.expofp.com/packages/master/fonts/oswald-v17-cyrillic_latin-300.woff2" rel="preload"/>
<script src="https://iltacon2022.expofp.com/data/data.js"></script><script src="https://iltacon2022.expofp.com/data/wf.data.js"></script><script src="https://iltacon2022.expofp.com/data/fp.svg.js"></script><script charset="utf-8" src="https://iltacon2022.expofp.com/packages/master/vendors~floorplan.js"></script><script charset="utf-8" src="https://iltacon2022.expofp.com/packages/master/floorplan.js"></script></head>
<body>
<noscript>You need to enable JavaScript to run this app.</noscript>
<div class="expofp-floorplan" data-event-id="iltacon2022"><div></div></div>
<script src="https://iltacon2022.expofp.com/packages/master/expofp.js"></script>
</body></html>

您的任务并非微不足道。这里有一个可能的解决方案:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time as t
import pandas as pd

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
actions = ActionChains(browser)
url = 'https://iltacon2022.expofp.com/'
browser.get(url) 
c_list = []
parent_el = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//div[@data-event-id="iltacon2022"]/div')))
parent_el_shadow_root = parent_el.shadow_root 
t.sleep(5)
companies_div = parent_el_shadow_root.find_element(By.CSS_SELECTOR, 'div[class="overlay-content__scrollable ps ps--active-y"]')
while True:
try:
companies = parent_el_shadow_root.find_elements(By.CSS_SELECTOR, "a[class = 'exhibitor-row list-row  ']")
for c in companies:
if len(c.text) > 3:
c_list.append((c.text.replace('n', ': '), c.get_attribute('href')))
print(f'we found {len(c_list)} companies')
actions.move_to_element(companies[len(c_list)]).perform()
print("moving to element", companies[len(c_list)].text.replace('n', ': '))
t.sleep(1)
companies[len(c_list)].send_keys(Keys.PAGE_DOWN)
print('scrolled page down')
t.sleep(2)
except Exception as e:
print('all done')
break
df = pd.DataFrame(list(set(c_list)), columns = ['Company', 'Url'])
df.to_csv('surveillance_capitalists.csv')
print(df)

由于阴影根在上面的代码中的位置,使用Chrome/cochromeDriver非常重要。上面的设置是针对linux的,但是你可以在你的机器上创建一个工作的selenium/chromedriver设置,然后你只需要观察导入,以及定义浏览器/驱动程序后的代码。终端中的打印输出会非常详细,它会告诉你发生了什么,最后会打印出一个带有公司及其各自url的数据帧(也会以csv文件的形式保存到磁盘(。然后,您可以抓取这些url,只需确保正确检查每个页面,找到影子根和其中的元素。Selenium文档可以在https://www.selenium.dev/documentation/

对于任何问题,只需在这里发表评论,或者在Selenium聊天室提问,我认为这非常有帮助。

AUT(测试中的应用程序(有时会尝试检测用于使用jquery访问应用程序的internet explorer浏览器。

根据讨论,Jquery未能检测到IE 11,而internet-explorer-10被正确检测到,internet-explorer-11没有被检测到,因为它使用了不同的用户代理:

Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv 11.0) like Gecko

建议的beta解决方案是:

if (!!navigator.userAgent.match(/Trident/7./))
return "ie";

似乎没有通过。然而,修改后的解决方案得以实施:

<script>
if (window.navigator.userAgent.indexOf("Trident/") !== -1) {
alert("Your are using old unsupported Internet Explorer browser.nPlease upgrade to view this page properly.");
}
</script>

您在<script>标记中观察到,这意味着,如果用户代理不包括字符串Trident,则您没有使用更新的IE v11并且您需要升级Internet Explorer浏览器版本。


结论

如果您使用Internet Explorer浏览器,则可能会观察到此设置的影响,否则您可以安全地忽略此设置,因为它不会影响您的测试。

最新更新