请参阅下面的代码。
概述
我有一个数据帧"df_master",它包含四列:Date;应该从中提取数据的特定df的名称;在指定日期进行测量;并在指定日期后一年进行测量。
我要做的是编写代码,这样,对于df_master中的每一行,代码都会从大量其他dfs中的一个将测量数据拉入df_master。每一行将具有不同的df,从中提取测量数据。我想在df_master中填充两个新列。一个是从df_master中该行所列日期的指定df中提取的测量数据,另一个是未来一年日期的测量数据。
示例
例如,考虑下面的代码。df_master中的第一行是"2016-01-01"one_answers"df_B"。这意味着df_master的第一行应该填充"Measurement_Today"的值250和"Measurements_One_Year_in_Future"的值265。
df_master的下一行将需要从df_C中提取Measurement数据,等等。换句话说,df_master中的每一行都将指定度量数据将来自的df,并且该df因行而异。
假设有数千个dfs,df_master必须从中提取数据,并且不可能将所有这些数千个df组合成一个df。
我曾尝试编写一个for循环,但没有成功。我还试着使用.iloc来提取未来一年的数据,但这也不起作用。
如能提供任何帮助,我们将不胜感激。非常感谢。
代码
# Import dependencies
import pandas as pd
import numpy as np
# Create 'df_A', 'df_B', 'df_C' and 'df_D', which contain measurement data on specific dates.
df_A = pd.DataFrame(np.array([['2016-01-01', 150], ['2017-01-01', 145],
['2018-01-01', 163], ['2019-01-01', 170],
['2020-01-01', 198], ['2021-01-01', 189],]),
columns=['Date', 'Measurement'])
df_A['Date'] = pd.to_datetime(df_A['Date'])
df_B = pd.DataFrame(np.array([['2016-01-01', 250], ['2017-01-01', 265],
['2018-01-01', 221], ['2019-01-01', 285],
['2020-01-01', 298], ['2021-01-01', 289],]),
columns=['Date', 'Measurement'])
df_B['Date'] = pd.to_datetime(df_B['Date'])
df_C = pd.DataFrame(np.array([['2016-01-01', 350], ['2017-01-01', 367],
['2018-01-01', 392], ['2019-01-01', 370],
['2020-01-01', 398], ['2021-01-01', 389],]),
columns=['Date', 'Measurement'])
df_C['Date'] = pd.to_datetime(df_C['Date'])
df_D = pd.DataFrame(np.array([['2016-01-01', 450], ['2017-01-01', 454],
['2018-01-01', 413], ['2019-01-01', 480],
['2020-01-01', 498], ['2021-01-01', 489],]),
columns=['Date', 'Measurement'])
df_D['Date'] = pd.to_datetime(df_D['Date'])
# Create df_master
df_master = pd.DataFrame(np.array([['2016-01-01', 'df_B','','' ], ['2017-01-01', 'df_C','','' ],
['2018-01-01', 'df_B','','' ], ['2019-01-01', 'df_A','','' ],
['2018-01-01', 'df_A','','' ], ['2019-01-01', 'df_D','','' ],]),
columns=['Date', 'df_to_pull_measurement_from', 'Measurement_Today',
'Measurement_Next_Year'])
df_master['Date'] = pd.to_datetime(df_master['Date'])
# Create list of dfs from df_master['df_to_pull_measurement_from'].
list_of_dfs = df_master['df_to_pull_measurement_from']
### THIS DOES NOT WORK ###
# Add columns to df_master for measurement on given date and measurement one year into future.
for row in list_of_dfs:
selected_df = row + '['Measurement']'
df_master['Measurement_Today'] = selected_df
df_master['Measurement_Next_Year'] = selected_df.iloc[idx +1]
我认为对字符串变量使用全局变量不是一个好主意,最好创建字典:
#create DatetimeIndex
df_master = df_master.set_index('Date')
for row in list_of_dfs:
selected_df = globals()[row].set_index('Date')['Measurement']
m1 = df_master['df_to_pull_measurement_from'].eq(row)
nexty = selected_df.rename(lambda x: x - pd.offsets.DateOffset(years= 1))
df_master.loc[m1, 'Measurement_Today'] = selected_df
df_master.loc[m1, 'Measurement_Next_Year'] = nexty
#create DatetimeIndex
df_master = df_master.set_index('Date')
#dict of DataFrames
dfs = {'df_A':df_A, 'df_B':df_B,'df_C':df_C, 'df_D':df_D}
#create DatetimeIndex in each DataFrame
dfs1 = {k: v.set_index('Date') for k, v in dfs.items()}
for row in list_of_dfs:
selected_df = dfs1[row]['Measurement']
m1 = df_master['df_to_pull_measurement_from'].eq(row)
#for next years is subtract one year from DatetimeIndex
nexty = selected_df.rename(lambda x: x - pd.offsets.DateOffset(years= 1))
df_master.loc[m1, 'Measurement_Today'] = selected_df
df_master.loc[m1, 'Measurement_Next_Year'] = nexty
print (df_master)
df_to_pull_measurement_from Measurement_Today Measurement_Next_Year
Date
2016-01-01 df_B 250 265
2017-01-01 df_C 367 392
2018-01-01 df_B 221 285
2019-01-01 df_A 170 198
2018-01-01 df_A 163 170
2019-01-01 df_D 480 498