如何在没有完整cookie详细信息的情况下抓取内容/如何绕过它



我使用seleniumbeautifulsoup从"确实"中抓取内容,但我也获得了整个cookie的详细信息。如何跳过cookie信息,只获取页面上显示的内容。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
from webdriver_manager.chrome import ChromeDriverManager

url = 'https://uk.indeed.com/viewjob?jk=b4fea8232173a7ae&tk=1gf89ph6gg3mi801&from=mobhp_jobfeed'
driver = webdriver.Chrome(ChromeDriverManager().install())
# driver = webdriver.Chrome(executable_path=DRIVER_PATH)
wait = WebDriverWait(driver, 20)
driver.get(url)
raw_text = BeautifulSoup(driver.page_source,"lxml",).get_text(strip=True, separator=". ")
print(raw_text)

问题需要一些改进来澄清-假设你只想要特定的信息,试着选择更具体的元素:

soup = BeautifulSoup(driver.page_source)
soup.select_one('.jobsearch-JobComponent').get_text('.',strip=True)

替代方案,以摆脱cookie横幅信息,只需点击它:

driver.find_element(By.CSS_SELECTOR, '#onetrust-reject-all-handler').click()
soup = BeautifulSoup(driver.page_source)

最新更新