从抓取的数据中删除页眉和页脚部分

我想删除标题和页脚部分，如果在一个抓取的数据。

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service
options = webdriver.ChromeOptions()
options.add_argument("--headless")
service = Service("/home/ubuntu/selenium_drivers/chromedriver")
URL = "https://www.uh.edu/kgmca/music/events/calendar/?view=e&id=30723#event"
try:
driver = webdriver.Chrome(service = service, options = options)
driver.get(URL)
driver.implicitly_wait(2)
html_content = driver.page_source
driver.quit()
except WebDriverException:
driver.quit()
soup = BeautifulSoup(html_content)
text = soup.getText(separator=u' ')

我试着删除标签，但它不工作。如何实现。

选项1:

只获取元素并使用.extract()。

选项2:

<main>标签正好在<header>和<footer>标签之间。如果你的只是想要那部分，你可以直接说:

main = soup.find('main')

还有，你为什么要使用Selenium?不简单地使用requests做的伎俩?

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service
options = webdriver.ChromeOptions()
options.add_argument("--headless")
service = Service("/home/ubuntu/selenium_drivers/chromedriver")
URL = "https://www.uh.edu/kgmca/music/events/calendar/?view=e&id=30723#event"
try:
driver = webdriver.Chrome(service = service, options = options)
driver.get(URL)
driver.implicitly_wait(2)
html_content = driver.page_source
driver.quit()
except WebDriverException:
driver.quit()
soup = BeautifulSoup(html_content)
text = soup.getText(separator=u' ')

for each in ['header','footer']:
s = soup.find(each)
s.extract()

相关内容

最新更新

热门标签：