我正在尝试对销售和费用数据集进行数据处理。数据具有开始日期和结束日期(非统一持续时间(。我需要将费用数据转换为开始日期和结束日期之间所有日期的每日数据。
我正在使用Python 3.6并尝试与Pandas合作。我尝试使用以下代码:
import pandas as pd
import datetime
df = pd.read_excel (r'C:WorkSamsungZZ.Codes for AutomatingData.xlsx')
df.mydates = pd.date_range(df['Start Date'], df['End Date']).tolist()
print(df)
我知道与我需要的输出相比,这种方法不正确,但我想看到输出存储在数据框的另一列中。但无论如何我都收到了错误。
我的输入如下所示:
Category Brand Subbrand Start Date End Date Sales Expense
0 C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244
1 C01 B01 S02 2017-01-01 2017-01-27 190582635 9614901
2 C01 B01 S03 2016-02-03 2016-03-04 78779415 9450124
3 C01 B02 S01 2016-01-01 2016-12-31 55267620 11460273
4 C01 B02 S02 2016-12-25 2017-03-06 120107421 8406527
5 C01 B02 S03 2018-12-20 2018-12-31 90111536 7775826
6 C01 B03 S01 2017-01-01 2017-01-27 32969210 6915853
7 C01 B03 S02 2016-02-03 2016-03-04 128042469 8548281
8 C01 B03 S03 2016-01-01 2016-12-31 169384918 11810253
9 C02 B01 S01 2016-06-16 2016-08-19 47879682 10355269
10 C02 B01 S02 2018-12-20 2018-12-31 78764777 9814553
11 C02 B01 S03 2017-01-01 2017-01-27 109319359 6156288
12 C02 B02 S01 2016-01-01 2016-12-31 109478365 5111520
13 C02 B02 S02 2016-06-16 2016-08-19 112928078 10275697
14 C02 B02 S03 2018-12-20 2018-12-31 80626697 9078153
15 C02 B03 S01 2016-02-03 2016-03-04 36118823 6066609
16 C02 B03 S02 2016-12-25 2017-03-06 156779337 8715303
17 C02 B03 S03 2018-12-20 2018-12-31 55672206 5914943
我的预期输出:
Category Brand Subbrand Start Date End Date Sales Expense Duration Daily_Date Daily Expense
C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 65 2016-06-16 168357.6
C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 65 2016-06-17 168357.6
C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 65 2016-06-18 168357.6
C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 65 2016-06-19 168357.6
…. …. …. …. …. …. …. …. …. ….
C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 65 2016-08-17 168357.6
C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 65 2016-08-18 168357.6
C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 65 2016-08-19 168357.6
C01 B01 S02 2017-01-01 2017-01-27 190582635 9614901 27 2017-01-01 356107.4444
C01 B01 S02 2017-01-01 2017-01-27 190582635 9614901 27 2017-01-02 356107.4444
C01 B01 S02 2017-01-01 2017-01-27 190582635 9614901 27 2017-01-03 356107.4444
…. …. …. …. …. …. …. …. …. ….
C01 B01 S02 2017-01-01 2017-01-27 190582635 9614901 27 2017-01-26 356107.4444
C01 B01 S02 2017-01-01 2017-01-27 190582635 9614901 27 2017-01-27 356107.4444
…. …. …. …. …. …. …. …. …. ….
C02 B02 S03 2018-12-20 2018-12-31 80626697 9078153 12 2018-12-20 756512.75
C02 B02 S03 2018-12-20 2018-12-31 80626697 9078153 12 2018-12-21 756512.75
C02 B02 S03 2018-12-20 2018-12-31 80626697 9078153 12 2018-12-22 756512.75
…. …. …. …. …. …. …. …. …. ….
C02 B02 S03 2018-12-20 2018-12-31 80626697 9078153 12 2018-12-30 756512.75
C02 B02 S03 2018-12-20 2018-12-31 80626697 9078153 12 2018-12-31 756512.75
请告诉我解决这个问题的方向。
谢谢:)。
请按如下方式查看您的预期答案:
import pandas as pd
df = pd.read_excel(r'C:WorkSamsungZZ.Codes for AutomatingData.xlsx',sheet_name=0,parse_date=["Start Date","End Date"])
df['Daily'] = df['Expense']/(((df['End Date']-df['Start Date']).dt.days)+1)
zipped = zip(df['Category'], df['Brand'],
df['Subbrand'], df['Start Date'], df['End Date'], df['Sales'],df['Expense'], df['Daily'])
L = [(C, B, S, SD, ED, Snum, Enum, x, Dnum) for C, B, S, SD, ED, Snum, Enum, Dnum in zipped for x in pd.date_range(SD, ED)]
df = pd.DataFrame(L, columns=['Category', 'Brand','Subbrand','Start Date','End Date','Sales','Expense','Daily_date','Daily'])
我直接计算了每日值,但如果您想要日差,那么您也可以在压缩数据时将其分开。
输出将如下所示:
Category Brand Subbrand Start Date End Date Sales Expense Daily_date Daily
0 C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 2016-06-16 168357.600000
1 C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 2016-06-17 168357.600000
2 C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 2016-06-18 168357.600000
3 C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 2016-06-19 168357.600000
4 C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 2016-06-20 168357.600000
5 C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 2016-06-21 168357.600000
6 C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 2016-06-22 168357.600000
7 C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 2016-06-23 168357.600000
8 C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 2016-06-24 168357.600000
9 C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 2016-06-25 168357.600000
10 C01 B01 S01 2016-06-16 2016-08-19 29445891 10943244 2016-06-26 168357.600000
... ... ... ... ... ... ... ... ... ...
1648 C02 B03 S03 2018-12-20 2018-12-31 55672206 5914943 2018-12-21 492911.916667
1649 C02 B03 S03 2018-12-20 2018-12-31 55672206 5914943 2018-12-22 492911.916667
1650 C02 B03 S03 2018-12-20 2018-12-31 55672206 5914943 2018-12-23 492911.916667
1651 C02 B03 S03 2018-12-20 2018-12-31 55672206 5914943 2018-12-24 492911.916667
1652 C02 B03 S03 2018-12-20 2018-12-31 55672206 5914943 2018-12-25 492911.916667
1653 C02 B03 S03 2018-12-20 2018-12-31 55672206 5914943 2018-12-26 492911.916667
1654 C02 B03 S03 2018-12-20 2018-12-31 55672206 5914943 2018-12-27 492911.916667
1655 C02 B03 S03 2018-12-20 2018-12-31 55672206 5914943 2018-12-28 492911.916667
1656 C02 B03 S03 2018-12-20 2018-12-31 55672206 5914943 2018-12-29 492911.916667
1657 C02 B03 S03 2018-12-20 2018-12-31 55672206 5914943 2018-12-30 492911.916667
1658 C02 B03 S03 2018-12-20 2018-12-31 55672206 5914943 2018-12-31 492911.916667
请从您的终端看一下,让我知道它是否适合您。
我只使用了相关列,排除了品牌等列。
duration = abs(pd.to_datetime(data.start_date) - pd.to_datetime(data.end_date))
data['duration'] = duration.apply(lambda d: d.days + 1)
data['daily_expense'] = data.apply(lambda x: x['expense']/x['duration'], axis=1)
expanded_data = pd.DataFrame(data=None,
columns=['start_date', 'end_date', 'expense', 'duration', 'daily_date', 'daily_expense'])
def expand_rows(sd, ed, exp, duration, daily_exp):
global expanded_data
single_row_expansion = []
for d in pd.date_range(start=sd, end=ed):
new_entry = {'start_date': sd, 'end_date': ed, 'expense': exp,
'duration': duration,
'daily_date': pd.datetime.strftime(d.date(), '%Y-%m-%d'),
'daily_expense': daily_exp}
single_row_expansion.append(new_entry)
single_row_exp_df = pd.DataFrame(single_row_expansion)
expanded_data = pd.concat([expanded_data, single_row_exp_df], sort=True)
data.apply(lambda r: expand_rows(*r), axis=1)
这将创建一个包含每日记录的新数据帧。
不确定要做什么,但这不是最佳方法,因为新数据帧中有很多冗余。您可以改为创建一个持续时间列,并在需要时计算每日费用。无需展开行。
另一种使用melt
的解决方案
# duration
df['Duration'] = (df['End Date'] - df['Start Date']).dt.days + 1
# daily expense
df['Daily Expense'] = df['Expense']/df['Duration']
# date range to unfold
df['range'] = df.apply(lambda r: pd.date_range(r['Start Date'], r['End Date'], freq='D'),
axis=1)
# define the columns to melt
columns = df.columns[:-1]
# convert range to series
# merge with current df
# remove the old range
# melt and dropna
df['range'].apply(pd.Series)
.merge(df, right_index = True, left_index = True)
.drop(['range'], axis = 1)
.melt(id_vars = columns, value_name = "Daily_Date").dropna()