从抓取的数据中删除页眉和页脚部分



我想删除标题和页脚部分,如果在一个抓取的数据。

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service
options = webdriver.ChromeOptions()
options.add_argument("--headless")
service = Service("/home/ubuntu/selenium_drivers/chromedriver")
URL = "https://www.uh.edu/kgmca/music/events/calendar/?view=e&id=30723#event"
try:
driver = webdriver.Chrome(service = service, options = options)
driver.get(URL)
driver.implicitly_wait(2)
html_content = driver.page_source
driver.quit()
except WebDriverException:
driver.quit()
soup = BeautifulSoup(html_content)
text = soup.getText(separator=u' ')

我试着删除标签,但它不工作。如何实现。

选项1:

只获取元素并使用.extract()

选项2:

<main>标签正好在<header><footer>标签之间。如果你的只是想要那部分,你可以直接说:

main = soup.find('main')

还有,你为什么要使用Selenium?不简单地使用requests做的伎俩?

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service
options = webdriver.ChromeOptions()
options.add_argument("--headless")
service = Service("/home/ubuntu/selenium_drivers/chromedriver")
URL = "https://www.uh.edu/kgmca/music/events/calendar/?view=e&id=30723#event"
try:
driver = webdriver.Chrome(service = service, options = options)
driver.get(URL)
driver.implicitly_wait(2)
html_content = driver.page_source
driver.quit()
except WebDriverException:
driver.quit()
soup = BeautifulSoup(html_content)
text = soup.getText(separator=u' ')

for each in ['header','footer']:
s = soup.find(each)
s.extract()            

最新更新