我正在从这个网站抓取多个页面的搜索结果到一个格式整齐的pandas数据框架中。
我已经概述了如何完成这项任务的步骤。
1)。从我想要提取的每个结果中识别信息(3件事)
2)。将这3件事中的所有信息放到单独的列表中
3)。通过for循环将列表中的项追加到pandas dataframe
以下是我到目前为止所做的尝试:
import requests
import pandas as pd
#!pip install bs4
from bs4 import BeautifulSoup as bs
url = 'https://www.federalregister.gov/documents/search?conditions%5Bpublication_date%5D%5Bgte%5D=08%2F28%2F2021&conditions%5Bterm%5D=economy'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(url, headers=headers)
soup = bs(result.text, 'html.parser')
titles = soup.find_all('h5')
authors = soup.find_all('p')
#dates = soup.find_all('')
#append in for loop
data=[]
for i in range(2,22):
data.append(titles[i].text)
data.append(authors[i].text)
#data.append(dates[i].text)
data=pd.DataFrame()
在将数据转换为pandas数据框架之前,我可以看到结果,但最后一行实际上删除了结果。
另外,我不太确定如何遍历多个搜索结果页面。我发现了一些代码,允许你选择一个开始和结束的网页迭代,像这样:
URL = ['https://www.federalregister.gov/documents/search?conditions%5Bpublication_date%5D%5Bgte%5D=08%2F28%2F2021&conditions%5Bterm%5D=economy&page=2',
'https://www.federalregister.gov/documents/search?conditions%5Bpublication_date%5D%5Bgte%5D=08%2F28%2F2021&conditions%5Bterm%5D=economy&page=4']
for url in range(0,2):
req = requests.get(URL[url])
soup = bs(req.text, 'html.parser')
titles = soup.find_all('h5')
print(titles)
我用这种方法遇到的问题是,第一页的格式与所有其他页面不一样。从第二页开始,url的末尾为"&page=2"。我不知道怎么解释。
总结一下,我想要的最终结果是一个看起来像这样的数据框架:
Title Author Date
Blah1 Agency1 09/23/2020
Blah2 Agency2 08/22/2018
Blah3 Agency3 06/02/2017
....
谁能告诉我正确的方向?我对这个很迷茫。
我认为你不需要解析所有的页面,只要下载csv文件就可以了。
import pandas as pd
import requests
import io
url = 'https://www.federalregister.gov/documents/search?conditions%5Bpublication_date%5D%5Bgte%5D=08%2F28%2F2021&conditions%5Bterm%5D=economy'
url += '&format=csv' # <- Download as CSV
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(url, headers=headers)
df = pd.read_csv(io.StringIO(result.text))
输出:
>>> df
title type ... pdf_url publication_date
0 Corporate Average Fuel Economy Standards for M... Proposed Rule ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/03/2021
1 Public Hearing for Corporate Average Fuel Econ... Proposed Rule ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/14/2021
2 Investigation of Urea Ammonium Nitrate Solutio... Notice ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/08/2021
3 Anchorage Regulations; Mississippi River, Mile... Proposed Rule ... https://www.govinfo.gov/content/pkg/FR-2021-08... 08/30/2021
4 Call for Nominations To Serve on the National ... Notice ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/08/2021
.. ... ... ... ... ...
112 Endangered and Threatened Wildlife and Plants;... Proposed Rule ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/07/2021
113 Energy Conservation Program: Test Procedures f... Proposed Rule ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/01/2021
114 Taking of Marine Mammals Incidental to Commerc... Rule ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/17/2021
115 Partial Approval and Partial Disapproval of Ai... Proposed Rule ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/24/2021
116 Clean Air Plans; California; San Joaquin Valle... Proposed Rule ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/01/2021
[117 rows x 8 columns]
如果我理解你的问题,那么这里是工作解决方案。起始url和页码= 1的url是一样的,我刮页范围(1,5)意味着4页。您可以随时增加或减少页码范围。若要以csv格式存储数据,请取消最后一行的注释。
代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
for page in range(1, 5):
url = 'https://www.federalregister.gov/documents/search?conditions%5Bpublication_date%5D%5Bgte%5D=08%2F28%2F2021&conditions%5Bterm%5D=economy%27&page={page}'.format(page=page)
print(url)
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
tags = soup.find_all('div', class_ ='document-wrapper')
for pro in tags:
title = pro.select_one('h5 a').get_text(strip = True)
author = pro.select_one('p a:nth-child(1)').get_text(strip = True)
date = pro.select_one('p a:nth-child(2)').get_text(strip = True)
data.append([title,author,date])
cols = ["Title", "Author","Date"]
df = pd.DataFrame(data,columns=cols)
print(df)
#df.to_csv("data_info.csv", index = False)