动态合并两个数据帧



背景

我有两个数据帧,一个包含一些不相关的列和预测销售量(x1,x2,…(

date        ColA        Colb    x1  x2  x3  …
1/01/2022   Apple       Cat     703 750 273 738
2/01/2022   Banana      Dog     540 908 397 726
3/01/2022   Orange      Pig     444 835 258 986
4/01/2022   Blueberry   Rat     969 582 393 911
5/01/2022   Watermelon  Bird    426 950 321 337
…       

另一个包含产品价格以及它们在商店开始销售的日期。

product start_date  price
x1      2/01/2022   0.9
x2      4/01/2022   0.5
x3      4/01/2022   1.2
…       5/01/2022   1.0

我想做什么

我正在尝试:

  1. 如果在产品销售开始日期之前,则将预测销量清零
  2. 将预测销售量乘以产品价格可获得预测收入

所需结果

date        ColA        Colb    x1      x2  x3      …
1/01/2022   Apple       Cat     0       0   0       0
2/01/2022   Banana      Dog     486     0   0       0
3/01/2022   Orange      Pig     399.6   0   0       0
4/01/2022   Blueberry   Rat     872.1   291 471.6   0
5/01/2022   Watermelon  Bird    383.4   475 385.2   337
…       

问题

问题是产品的数量各不相同。我目前正在手动编辑我的代码,每次做这个总和。乘以价格很容易,但Start_Date很难,我不知道有什么更简单的方法可以动态地做到这一点。下面是我当前代码的一个示例。

注意:在大多数情况下,两个数据帧的顺序相同(即x1、x2、x3等(

代码示例

import pandas as pd
# Dataframe one example
df1 = pd.DataFrame(data={
'Date': pd.Series(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05'], dtype='datetime64[ns]'),
'ColA': ['Apple', 'Banana', 'Orange', 'Blueberry', 'Watermelon'],
'ColB': ['Cat', 'Dog', 'Pig', 'Rat', 'Bird'],
'x1': [703, 540, 444, 969, 426],
'x2': [750, 908, 835, 582, 950],
'x3': [273, 397, 258, 393, 321],
'x4': [738, 726, 986, 911, 337]
})
# Dataframe two example
df2 = pd.DataFrame(data={
'Product': ['x1', 'x2', 'x3', 'x4'],
'Start_Date': pd.Series(['2022-01-02', '2022-01-04', '2022-01-04', '2022-01-05'], dtype='datetime64[ns]'),
'Price': [0.9, 0.5, 1.2, 1.0]
})
df2.set_index('Product')
# Multipy by Price
df3 = df1.iloc[:, 3:7].multiply(df2[['Price']].values.T, axis='columns')
df3['sd1'] = df2['Start_Date'].iloc[0]
df3['sd2'] = df2['Start_Date'].iloc[1]
df3['sd3'] = df2['Start_Date'].iloc[2]
df3['sd4'] = df2['Start_Date'].iloc[3]
df3 = df1[['Date', 'ColA', 'ColB']].merge(df3, left_index=True, right_index=True)
print(df3)

下面的代码片段应该可以解决您的目的:(:

from datetime import datetime
import pandas as pd
# Dataframe one example
df1 = pd.DataFrame(data={
'Date': pd.Series(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05'], dtype='datetime64[ns]'),
'ColA': ['Apple', 'Banana', 'Orange', 'Blueberry', 'Watermelon'],
'ColB': ['Cat', 'Dog', 'Pig', 'Rat', 'Bird'],
'x1': [703, 540, 444, 969, 426],
'x2': [750, 908, 835, 582, 950],
'x3': [273, 397, 258, 393, 321],
'x4': [738, 726, 986, 911, 337]
})
# Dataframe two example
df2 = pd.DataFrame(data={
'Product': ['x1', 'x2', 'x3', 'x4'],
'Start_Date': pd.Series(['2022-01-02', '2022-01-04', '2022-01-04', '2022-01-05'], dtype='datetime64[ns]'),
'Price': [0.9, 0.5, 1.2, 1.0]
})
df2.set_index('Product', inplace=True)
# Multipy by Price and Generate final df
cols_to_work = [col for col in df1.columns if col.startswith('x')]
datetime_col = 'Date'
datetime_col_df2 = 'Start_Date'
for idx, rows in df1.iterrows():
for col in cols_to_work:
if rows[datetime_col] >= df2[datetime_col_df2].loc[col]:
# print(rows[datetime_col], df2['Start_Date'].loc[col], rows[datetime_col] >= df2['Start_Date'].loc[col])
# print(col, '-', df1[col].loc[idx], ' * ', df2['Price'].loc[col], ' = ', df1[col].loc[idx] * df2['Price'].loc[col])
df1[col].loc[idx] = df1[col].loc[idx] * df2['Price'].loc[col]
else:
df1[col].loc[idx] = 0
df1.head()

输出:

Date        ColA  ColB     x1     x2     x3     x4
0 2022-01-01       Apple   Cat    0.0    0.0    0.0    0.0
1 2022-01-02      Banana   Dog  486.0    0.0    0.0    0.0
2 2022-01-03      Orange   Pig  399.6    0.0    0.0    0.0
3 2022-01-04   Blueberry   Rat  872.1  291.0  471.6    0.0
4 2022-01-05  Watermelon  Bird  383.4  475.0  385.2  337.0

最新更新