我是网页抓取的新手。我正在试着从新闻网站上抓取数据。
我有这样的代码:
from bs4 import BeautifulSoup as soup
import pandas as pd
import requests
detik_url = "https://news.detik.com/indeks/2"
detik_url
html = requests.get(detik_url)
bsobj = soup(html.content, 'lxml')
bsobj
for link in bsobj.findAll("h3"):
print("Headline : {}".format(link.text.strip()))
links = []
for news in bsobj.findAll('article',{'class':'list-content__item'}):
links.append(news.a['href'])
for link in links:
page = requests.get(link)
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
for p in div:
print(p.find('p').text.strip())
如何利用Pandas Dataframe将获得的内容存储到CSV文件中?
您可以将内容存储在pandas数据框架中,然后将该结构写入csv文件。
假设您想将p.find('p').text.strip()
中的所有文本以及标题保存在csv文件中,您可以将标题存储在任何变量中(例如head
):
那么,从你的代码:
for link in links:
page = requests.get(link)
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
for p in div: # <----- Here we make the changes
print(p.find('p').text.strip())
在上面的行中,我们做了以下操作:
import pandas as pd
# Create an empty array to store all the data
generated_text = [] # create an array to store your data
for link in links:
page = requests.get(link)
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
for p in div:
# print statement if you want to see the output
generated_text.append(p.find('p').text.strip()) # <---- save the data in an array
# then write this into a csv file using pandas, first you need to create a
# dataframe from our list
df = pd.DataFrame(generated_text, columns = [head])
# save this into a csv file
df.to_csv('csv_name.csv', index = False)
此外,您可以直接使用列表推导式,而不是for循环,并保存到您的CSV。
# instead of the above snippet, replace the whole `for p in div` loop by
# So from your code above:
.....
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
# Remove the whole `for p in div:` and instead use this:
df = pd.DataFrame([p.find('p').text.strip() for p in div], columns = [head])
....
df.to_csv('csv_name.csv', index = False)
还可以将列表推导生成的数组转换为numpy数组,并直接写入csv文件:
import numpy as np
import pandas as pd
# On a side note:
# convert your normal array to numpy array or use list comprehension to make a numpy array,
# also there are faster ways to convert a normal array to numpy array which you can explore,
# from there you can write to a csv
pd.DataFrame(nparray).to_csv('csv_name.csv'))