如果过滤器中有日期范围,如何抓取历史数据?



我试图在以下url上抓取一些历史数据:https://markets.ft.com/data/funds/tearsheet/historical?s=LU0526609390:EUR我想抓取所有的历史数据,但是网站只允许我抓取最近30天的每日价格。为了进一步回溯,我必须使用过滤器,一次只能过滤一年。

我可以很容易地从第一个表中获取可用的信息,使用以下代码获取一对资金:

import pandas as pd
import datetime
import csv
urls = ['https://markets.ft.com/data/funds/tearsheet/historical?s=LU0526609390:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=IE00BHBX0Z19:EUR', 
'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1076093779:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1116896363:EUR']
# Change date format as there appears to be two versions of the date on the FT website for different sized browsers
def format_date(date):
date = date.split(',')[-2][1:] + date.split(',')[-1]
return pd.Series({'Date': date})
# Create list to allow all scraping data to be saved in one .csv file
dfs = []
# Create scraping loop for all defined urls
for url in urls:
ISIN = url.split('=')[-1].replace(':', '_')
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
df['Date'] = df['Date'].apply(format_date)
print (df)
dfs.append(df)

但是,我无法使用网页上的过滤器来获取更多的历史数据?我尝试了很多东西,但总是得到不同的错误信息。我该怎么做呢?

数据是从允许设置日期范围的api加载的,例如https://markets.ft.com/data/equities/ajax/get-historical-prices?startDate=2020/10/01&endDate=2021/10/01&symbol=535700333。这使得可以跳过过滤器问题:

import requests
import pandas as pd
from datetime import datetime
import time
#create list of annual dates for the past 100 years starting from today
datelist = pd.date_range(end=datetime.now(),periods=100,freq=pd.DateOffset(years=1))[::-1].strftime('%Y/%m/%d')
#create empty df
df = pd.DataFrame(None, columns=['Date','Open','High','Low','Close','Volume'])
#not sure when the historical data starts, so let's wrap it in a while loop
while True:
for end, start in zip(datelist, datelist[1:]):
try:
r = requests.get(f'https://markets.ft.com/data/equities/ajax/get-historical-prices?startDate={start}&endDate={end}&symbol=535700333').json()
df_temp = pd.read_html('<table>'+r['html']+'</table>')[0]
df_temp.columns=['Date','Open','High','Low','Close','Volume']
df = df.append(df_temp)
time.sleep(2)
except:
break
break

输出:

78.878.878.8,78.8978.8978.8978.89,2021年10月13日星期三78.778.778.778.7,78.5878.5878.5878.58,78.5878.5878.5878.58,

最新更新