用beautifulsoup将输出保存到数据框中



我是网页抓取的新手。我正在试着从新闻网站上抓取数据。

我有这样的代码:

from bs4 import BeautifulSoup as soup
import pandas as pd
import requests
detik_url = "https://news.detik.com/indeks/2"
detik_url
html = requests.get(detik_url)
bsobj = soup(html.content, 'lxml')
bsobj
for link in bsobj.findAll("h3"):
print("Headline : {}".format(link.text.strip()))
links = []
for news in bsobj.findAll('article',{'class':'list-content__item'}):
links.append(news.a['href'])
for link in links:
page = requests.get(link)
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
for p in div:
print(p.find('p').text.strip())

如何利用Pandas Dataframe将获得的内容存储到CSV文件中?

您可以将内容存储在pandas数据框架中,然后将该结构写入csv文件。

假设您想将p.find('p').text.strip()中的所有文本以及标题保存在csv文件中,您可以将标题存储在任何变量中(例如head):

那么,从你的代码:

for link in links:
page = requests.get(link)
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
for p in div:                 # <----- Here we make the changes
print(p.find('p').text.strip())

在上面的行中,我们做了以下操作:

import pandas as pd
# Create an empty array to store all the data
generated_text = []  # create an array to store your data
for link in links:
page = requests.get(link)
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
for p in div:
# print statement if you want to see the output
generated_text.append(p.find('p').text.strip())  # <---- save the data in an array

# then write this into a csv file using pandas, first you need to create a 
# dataframe from our list
df = pd.DataFrame(generated_text, columns = [head])
# save this into a csv file
df.to_csv('csv_name.csv', index = False)

此外,您可以直接使用列表推导式,而不是for循环,并保存到您的CSV。


# instead of the above snippet, replace the whole `for p in div` loop by
# So from your code above:
.....
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
# Remove the whole  `for p in div:` and instead use this:
df = pd.DataFrame([p.find('p').text.strip() for p in div], columns = [head])
....
df.to_csv('csv_name.csv', index = False)

还可以将列表推导生成的数组转换为numpy数组,并直接写入csv文件:

import numpy as np
import pandas as pd
# On a side note: 
# convert your normal array  to numpy array or use list comprehension to make a numpy array, 
# also there are faster ways to convert a normal array to numpy array which you can explore,
# from there you can write to a csv
pd.DataFrame(nparray).to_csv('csv_name.csv'))

相关内容

  • 没有找到相关文章

最新更新