以csv格式存储HTML数据,使用pandas进行处理



我正在抓取url并将完整的html数据存储为pandas数据框架以存储为csv文件并清理。

代码1

import pandas
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
url_list = ["https://www.flagstaffsymphony.org/event/masterworks-v-saint-saens-and-bruckner/",
"https://www.berlinerfestspiele.de/de/berliner-festspiele/programm/bfs-gesamtprogramm/programmdetail_341787.html",
"https://www.seattlesymphony.org/en/concerttickets/calendar/2021-2022/21bar3"]

headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
}
driver = webdriver.Chrome('/home/ubuntu/selenium_drivers/chromedriver')
url_data = []
columns = ['url_list','data']
for URL in url_list:
driver.get(URL)
driver.implicitly_wait(2)
data = driver.page_source
row_data = [URL,data]
url_data.append(row_data)
html_data = pd.DataFrame(url_data, columns = ['urllist', 'data'])
html_data["parsedata"] = BeautifulSoup(str(html_data["data"]), "lxml").text
cleanr = re.compile('<.*?>')
html_data["cleandata"] = re.sub(cleanr, '', str(html_data["parsedata"]))

但是在清理html_data[" cleanddata "]之后给出一些垃圾值,而不是清理后的数据。当尝试作为单个url进行清理时,它可以工作。如何清理存储在pandas数据框架中的html数据。

beautiful soup解析器处理文本字符串,但是在执行BeautifulSoup(str(html_data["data"]), ...)

时传递的是pandas系列。修复方法是简单地应用函数逐行解析和单独清理文本

html_data = pd.DataFrame(url_data, columns = ['urllist', 'data'])
html_data["parsedata"] =  html_data.data.apply(lambda x: BeautifulSoup(x, "lxml").text)
cleanr = re.compile('<.*?>')
html_data["cleandata"] = html_data.parsedata.apply(lambda x: re.sub(cleanr, '', x))

此外,我建议您在将url_data添加到列表以创建数据框html_data之前进行解析和清理:

cleanr = re.compile('<.*?>')
for URL in url_list:
driver.get(URL)
driver.implicitly_wait(2)
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
cleaned_data = re.sub(cleanr, '', soup.text)
url_data.append([URL, cleaned_data])
html_data = pd.DataFrame(url_data, columns = ['urllist', 'cleandata'])