如何从其他几个dfs中的一个将数据拉入pandas df_master,其中从中提取数据的df_master在df_ma



请参阅下面的代码。

概述

我有一个数据帧"df_master",它包含四列:Date;应该从中提取数据的特定df的名称;在指定日期进行测量;并在指定日期后一年进行测量。

我要做的是编写代码,这样,对于df_master中的每一行,代码都会从大量其他dfs中的一个将测量数据拉入df_master。每一行将具有不同的df,从中提取测量数据。我想在df_master中填充两个新列。一个是从df_master中该行所列日期的指定df中提取的测量数据,另一个是未来一年日期的测量数据。

示例

例如,考虑下面的代码。df_master中的第一行是"2016-01-01"one_answers"df_B"。这意味着df_master的第一行应该填充"Measurement_Today"的值250和"Measurements_One_Year_in_Future"的值265。

df_master的下一行将需要从df_C中提取Measurement数据,等等。换句话说,df_master中的每一行都将指定度量数据将来自的df,并且该df因行而异。

假设有数千个dfs,df_master必须从中提取数据,并且不可能将所有这些数千个df组合成一个df。

我曾尝试编写一个for循环,但没有成功。我还试着使用.iloc来提取未来一年的数据,但这也不起作用。

如能提供任何帮助,我们将不胜感激。非常感谢。

代码

# Import dependencies
import pandas as pd
import numpy as np
# Create 'df_A', 'df_B', 'df_C' and 'df_D', which contain measurement data on specific dates.
df_A = pd.DataFrame(np.array([['2016-01-01', 150], ['2017-01-01', 145], 
['2018-01-01', 163], ['2019-01-01', 170],
['2020-01-01', 198], ['2021-01-01', 189],]),
columns=['Date', 'Measurement'])
df_A['Date'] = pd.to_datetime(df_A['Date'])
df_B = pd.DataFrame(np.array([['2016-01-01', 250], ['2017-01-01', 265], 
['2018-01-01', 221], ['2019-01-01', 285],
['2020-01-01', 298], ['2021-01-01', 289],]),
columns=['Date', 'Measurement'])
df_B['Date'] = pd.to_datetime(df_B['Date'])
df_C = pd.DataFrame(np.array([['2016-01-01', 350], ['2017-01-01', 367], 
['2018-01-01', 392], ['2019-01-01', 370],
['2020-01-01', 398], ['2021-01-01', 389],]),
columns=['Date', 'Measurement'])
df_C['Date'] = pd.to_datetime(df_C['Date'])
df_D = pd.DataFrame(np.array([['2016-01-01', 450], ['2017-01-01', 454], 
['2018-01-01', 413], ['2019-01-01', 480],
['2020-01-01', 498], ['2021-01-01', 489],]),
columns=['Date', 'Measurement'])
df_D['Date'] = pd.to_datetime(df_D['Date'])

# Create df_master 
df_master = pd.DataFrame(np.array([['2016-01-01', 'df_B','','' ], ['2017-01-01', 'df_C','',''  ], 
['2018-01-01', 'df_B','','' ], ['2019-01-01', 'df_A','','' ],
['2018-01-01', 'df_A','','' ], ['2019-01-01', 'df_D','','' ],]),
columns=['Date', 'df_to_pull_measurement_from', 'Measurement_Today', 
'Measurement_Next_Year'])
df_master['Date'] = pd.to_datetime(df_master['Date'])

# Create list of dfs from df_master['df_to_pull_measurement_from'].
list_of_dfs = df_master['df_to_pull_measurement_from']
### THIS DOES NOT WORK ###
# Add columns to df_master for measurement on given date and measurement one year into future.
for row in list_of_dfs:
selected_df = row + '['Measurement']'
df_master['Measurement_Today'] = selected_df
df_master['Measurement_Next_Year'] = selected_df.iloc[idx +1]

我认为对字符串变量使用全局变量不是一个好主意,最好创建字典:

#create DatetimeIndex 
df_master = df_master.set_index('Date')
for row in list_of_dfs:
selected_df = globals()[row].set_index('Date')['Measurement']
m1 = df_master['df_to_pull_measurement_from'].eq(row)
nexty = selected_df.rename(lambda x: x - pd.offsets.DateOffset(years= 1))
df_master.loc[m1, 'Measurement_Today'] = selected_df
df_master.loc[m1, 'Measurement_Next_Year'] = nexty


#create DatetimeIndex 
df_master = df_master.set_index('Date')
#dict of DataFrames
dfs = {'df_A':df_A, 'df_B':df_B,'df_C':df_C, 'df_D':df_D}
#create DatetimeIndex in each DataFrame
dfs1 = {k: v.set_index('Date') for k, v in dfs.items()}
for row in list_of_dfs:
selected_df = dfs1[row]['Measurement']
m1 = df_master['df_to_pull_measurement_from'].eq(row)
#for next years is subtract one year from DatetimeIndex
nexty = selected_df.rename(lambda x: x - pd.offsets.DateOffset(years= 1))
df_master.loc[m1, 'Measurement_Today'] = selected_df
df_master.loc[m1, 'Measurement_Next_Year'] = nexty

print (df_master)
df_to_pull_measurement_from Measurement_Today Measurement_Next_Year
Date                                                                          
2016-01-01                        df_B               250                   265
2017-01-01                        df_C               367                   392
2018-01-01                        df_B               221                   285
2019-01-01                        df_A               170                   198
2018-01-01                        df_A               163                   170
2019-01-01                        df_D               480                   498

最新更新