如何对每日时序对象迭代网页抓取脚本，以便从网页创建每日时序数据

感谢您查看我的问题。我使用BeautifulSoup和Pandas创建了一个脚本，该脚本从美联储网站上抓取预测数据。预测每季度出来一次(~3个月)。我想写一个脚本，创建一个每日时间序列，每天检查一次美联储网站，如果有新的预测发布，脚本会将其添加到时间序列中。如果没有更新，则脚本只会在时间序列后附加最后一个有效的更新投影。

从我最初的挖掘来看，似乎有外部资源可以用来每天"触发"脚本，但我更愿意让所有东西都纯粹是 python。

我为完成抓取而编写的代码如下：

from bs4 import BeautifulSoup
import requests
import re
import wget
import pandas as pd 
# Starting url and the indicator (key) for links of interest
url = "https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm" 
key = '/monetarypolicy/fomcprojtabl'
# Cook the soup
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
# Create the tuple of links for projection pages
projections = []
for link in soup.find_all('a', href=re.compile(key)):
projections.append(link["href"])
# Create a tuple to store the projections 
decfcasts = []
for i in projections:
url = "https://www.federalreserve.gov{}".format(i)
file = wget.download(url)
df_list = pd.read_html(file)
fcast = df_list[-1].iloc[:,0:2]
fcast.columns = ['Target', 'Votes']
fcast.fillna(0, inplace = True)
decfcasts.append(fcast)

到目前为止，我编写的代码将所有内容放在一个元组中，但没有数据的时间/日期索引。我一直在考虑编写伪代码，我的猜测是它看起来像

Create daily time series object
for each day in time series:
if day in time series = day in link:
run webscraper
other wise, append time series with last available observation

至少，这是我的想法。最终的时间序列可能最终看起来相当"团块"，因为有很多天会有相同的观察结果，然后当一个新的投影出来时，会有一个"跳跃"，然后有更多的重复，直到下一个预测出来。

显然，非常感谢任何帮助。提前感谢，无论哪种方式！

我已经为您编辑了代码。现在它从网址获取日期。日期在数据框中另存为时间段。仅当数据帧中不存在日期(从泡菜还原)时，才会处理和追加日期。

from bs4 import BeautifulSoup
import requests
import re
import wget
import pandas as pd
# Starting url and the indicator (key) for links of interest
url = "https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm"
key = '/monetarypolicy/fomcprojtabl'
# Cook the soup
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
# Create the tuple of links for projection pages
projections = []
for link in soup.find_all('a', href=re.compile(key)):
projections.append(link["href"])
# past results from pickle, when no pickle init empty dataframe
try:
decfcasts = pd.read_pickle('decfcasts.pkl')
except FileNotFoundError:
decfcasts = pd.DataFrame(columns=['target', 'votes', 'date'])

for i in projections:
# parse date from url
date = pd.Period(''.join(re.findall(r'd+', i)), 'D')
# process projection if it wasn't included in data from pickle
if date not in decfcasts['date'].values:
url = "https://www.federalreserve.gov{}".format(i)
file = wget.download(url)
df_list = pd.read_html(file)
fcast = df_list[-1].iloc[:, 0:2]
fcast.columns = ['target', 'votes']
fcast.fillna(0, inplace=True)
# set date time
fcast.insert(2, 'date', date)
decfcasts = decfcasts.append(fcast)
# save to pickle
pd.to_pickle(decfcasts, 'decfcasts.pkl')

相关内容

最新更新

热门标签：