如何从开始日期结束日期数据创建每日支出



我正在尝试对销售和费用数据集进行数据处理。数据具有开始日期和结束日期(非统一持续时间(。我需要将费用数据转换为开始日期和结束日期之间所有日期的每日数据。

我正在使用Python 3.6并尝试与Pandas合作。我尝试使用以下代码:

import pandas as pd
import datetime
df = pd.read_excel (r'C:WorkSamsungZZ.Codes for AutomatingData.xlsx')
df.mydates = pd.date_range(df['Start Date'], df['End Date']).tolist()
print(df)

我知道与我需要的输出相比,这种方法不正确,但我想看到输出存储在数据框的另一列中。但无论如何我都收到了错误。

我的输入如下所示:

   Category Brand Subbrand Start Date   End Date      Sales   Expense
0       C01   B01      S01 2016-06-16 2016-08-19   29445891  10943244
1       C01   B01      S02 2017-01-01 2017-01-27  190582635   9614901
2       C01   B01      S03 2016-02-03 2016-03-04   78779415   9450124
3       C01   B02      S01 2016-01-01 2016-12-31   55267620  11460273
4       C01   B02      S02 2016-12-25 2017-03-06  120107421   8406527
5       C01   B02      S03 2018-12-20 2018-12-31   90111536   7775826
6       C01   B03      S01 2017-01-01 2017-01-27   32969210   6915853
7       C01   B03      S02 2016-02-03 2016-03-04  128042469   8548281
8       C01   B03      S03 2016-01-01 2016-12-31  169384918  11810253
9       C02   B01      S01 2016-06-16 2016-08-19   47879682  10355269
10      C02   B01      S02 2018-12-20 2018-12-31   78764777   9814553
11      C02   B01      S03 2017-01-01 2017-01-27  109319359   6156288
12      C02   B02      S01 2016-01-01 2016-12-31  109478365   5111520
13      C02   B02      S02 2016-06-16 2016-08-19  112928078  10275697
14      C02   B02      S03 2018-12-20 2018-12-31   80626697   9078153
15      C02   B03      S01 2016-02-03 2016-03-04   36118823   6066609
16      C02   B03      S02 2016-12-25 2017-03-06  156779337   8715303
17      C02   B03      S03 2018-12-20 2018-12-31   55672206   5914943

我的预期输出:

Category Brand Subbrand Start Date End Date Sales Expense Duration Daily_Date Daily Expense
C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 65 2016-06-16 168357.6
C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 65 2016-06-17 168357.6
C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 65 2016-06-18 168357.6
C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 65 2016-06-19 168357.6
…. …. …. …. …. …. …. …. …. ….
C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 65 2016-08-17 168357.6
C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 65 2016-08-18 168357.6
C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 65 2016-08-19 168357.6
C01 B01 S02 2017-01-01 2017-01-27 190582635 9614901 27 2017-01-01 356107.4444
C01 B01 S02 2017-01-01 2017-01-27 190582635 9614901 27 2017-01-02 356107.4444
C01 B01 S02 2017-01-01 2017-01-27 190582635 9614901 27 2017-01-03 356107.4444
…. …. …. …. …. …. …. …. …. ….
C01 B01 S02 2017-01-01 2017-01-27 190582635 9614901 27 2017-01-26 356107.4444
C01 B01 S02 2017-01-01 2017-01-27 190582635 9614901 27 2017-01-27 356107.4444
…. …. …. …. …. …. …. …. …. ….
C02 B02 S03 2018-12-20 2018-12-31 80626697 9078153 12 2018-12-20 756512.75
C02 B02 S03 2018-12-20 2018-12-31 80626697 9078153 12 2018-12-21 756512.75
C02 B02 S03 2018-12-20 2018-12-31 80626697 9078153 12 2018-12-22 756512.75
…. …. …. …. …. …. …. …. …. ….
C02 B02 S03 2018-12-20 2018-12-31 80626697 9078153 12 2018-12-30 756512.75
C02 B02 S03 2018-12-20 2018-12-31 80626697 9078153 12 2018-12-31 756512.75

请告诉我解决这个问题的方向。

谢谢:)。

请按如下方式查看您的预期答案:

import pandas as pd
df = pd.read_excel(r'C:WorkSamsungZZ.Codes for AutomatingData.xlsx',sheet_name=0,parse_date=["Start Date","End Date"])
df['Daily'] = df['Expense']/(((df['End Date']-df['Start Date']).dt.days)+1)
zipped = zip(df['Category'], df['Brand'], 
             df['Subbrand'], df['Start Date'], df['End Date'], df['Sales'],df['Expense'], df['Daily'])
L = [(C, B, S, SD, ED, Snum, Enum, x, Dnum) for C, B, S, SD, ED, Snum, Enum, Dnum in zipped for x in pd.date_range(SD, ED)]
df = pd.DataFrame(L, columns=['Category', 'Brand','Subbrand','Start Date','End Date','Sales','Expense','Daily_date','Daily'])

我直接计算了每日值,但如果您想要日差,那么您也可以在压缩数据时将其分开。

输出将如下所示:

    Category    Brand   Subbrand    Start Date  End Date    Sales   Expense Daily_date  Daily
0   C01 B01 S01 2016-06-16  2016-08-19  29445891    10943244    2016-06-16  168357.600000
1   C01 B01 S01 2016-06-16  2016-08-19  29445891    10943244    2016-06-17  168357.600000
2   C01 B01 S01 2016-06-16  2016-08-19  29445891    10943244    2016-06-18  168357.600000
3   C01 B01 S01 2016-06-16  2016-08-19  29445891    10943244    2016-06-19  168357.600000
4   C01 B01 S01 2016-06-16  2016-08-19  29445891    10943244    2016-06-20  168357.600000
5   C01 B01 S01 2016-06-16  2016-08-19  29445891    10943244    2016-06-21  168357.600000
6   C01 B01 S01 2016-06-16  2016-08-19  29445891    10943244    2016-06-22  168357.600000
7   C01 B01 S01 2016-06-16  2016-08-19  29445891    10943244    2016-06-23  168357.600000
8   C01 B01 S01 2016-06-16  2016-08-19  29445891    10943244    2016-06-24  168357.600000
9   C01 B01 S01 2016-06-16  2016-08-19  29445891    10943244    2016-06-25  168357.600000
10  C01 B01 S01 2016-06-16  2016-08-19  29445891    10943244    2016-06-26  168357.600000
... ... ... ... ... ... ... ... ... ...
1648    C02 B03 S03 2018-12-20  2018-12-31  55672206    5914943 2018-12-21  492911.916667
1649    C02 B03 S03 2018-12-20  2018-12-31  55672206    5914943 2018-12-22  492911.916667
1650    C02 B03 S03 2018-12-20  2018-12-31  55672206    5914943 2018-12-23  492911.916667
1651    C02 B03 S03 2018-12-20  2018-12-31  55672206    5914943 2018-12-24  492911.916667
1652    C02 B03 S03 2018-12-20  2018-12-31  55672206    5914943 2018-12-25  492911.916667
1653    C02 B03 S03 2018-12-20  2018-12-31  55672206    5914943 2018-12-26  492911.916667
1654    C02 B03 S03 2018-12-20  2018-12-31  55672206    5914943 2018-12-27  492911.916667
1655    C02 B03 S03 2018-12-20  2018-12-31  55672206    5914943 2018-12-28  492911.916667
1656    C02 B03 S03 2018-12-20  2018-12-31  55672206    5914943 2018-12-29  492911.916667
1657    C02 B03 S03 2018-12-20  2018-12-31  55672206    5914943 2018-12-30  492911.916667
1658    C02 B03 S03 2018-12-20  2018-12-31  55672206    5914943 2018-12-31  492911.916667

请从您的终端看一下,让我知道它是否适合您。

我只使用了相关列,排除了品牌等列。

duration = abs(pd.to_datetime(data.start_date) - pd.to_datetime(data.end_date))
data['duration'] = duration.apply(lambda d: d.days + 1)
data['daily_expense'] = data.apply(lambda x: x['expense']/x['duration'], axis=1)
expanded_data = pd.DataFrame(data=None, 
                             columns=['start_date', 'end_date', 'expense', 'duration', 'daily_date', 'daily_expense'])
def expand_rows(sd, ed, exp, duration, daily_exp):
    global expanded_data
    single_row_expansion = []
    for d in pd.date_range(start=sd, end=ed):
        new_entry = {'start_date': sd, 'end_date': ed, 'expense': exp,
                     'duration': duration, 
                     'daily_date': pd.datetime.strftime(d.date(), '%Y-%m-%d'),
                    'daily_expense': daily_exp}
        single_row_expansion.append(new_entry)
    single_row_exp_df = pd.DataFrame(single_row_expansion)
    expanded_data = pd.concat([expanded_data, single_row_exp_df], sort=True)
data.apply(lambda r: expand_rows(*r), axis=1)

这将创建一个包含每日记录的新数据帧。

不确定要做什么,但这不是最佳方法,因为新数据帧中有很多冗余。您可以改为创建一个持续时间列,并在需要时计算每日费用。无需展开行。

另一种使用melt的解决方案

# duration
df['Duration']  = (df['End Date'] - df['Start Date']).dt.days + 1
# daily expense
df['Daily Expense'] = df['Expense']/df['Duration']
# date range to unfold
df['range'] = df.apply(lambda r: pd.date_range(r['Start Date'], r['End Date'], freq='D'),
                       axis=1)
# define the columns to melt
columns = df.columns[:-1]
# convert range to series
# merge with current df
# remove the old range
# melt and dropna
df['range'].apply(pd.Series) 
    .merge(df, right_index = True, left_index = True) 
    .drop(['range'], axis = 1)
    .melt(id_vars = columns, value_name = "Daily_Date").dropna()

最新更新