>我有一个包含费用和发票值的df。有些工作 #s 实际上有2个工作 #s - 一个旧的和新的工作#。我需要对同时具有新旧作业 #s 的工作 #s 的费用和发票值求和;结果需要为一行。然后,我需要删除具有新作业#条目的行。
我有一个Excel文件,列出了每个工作的费用和发票总额。在我的代码中,这是crcy:
Job# Expenses Invoice
1 5 2
2 10 27
3 15 33
10 60 4
20 57 21
12 9 36
22 11 18
然后我有一个列出新旧工作 #s 的 excel 文件:
Job# New Job#
10 20
11 21
12 22
我不太确定在这里使用哪种 Pandas 操作,所以我不知道该尝试什么。任何建议都非常感谢。
import pandas as pd
# Pull in excel data
crcy = pd.read_excel('crcy1.xlsx')
jobs = pd.read_excel('jobs.xlsx')
# Merge on job#
df3 = crcy.merge(jobs, on='Job#', how='outer')
# Drop rows where new job # is pulled in but
# has no Expense/Invoice entries.
df3 = df3.dropna(thresh=3)
print(df3)
实际结果:
Job# Expenses Invoice New Job#
1 5.0 2.0 0.0
2 10.0 27.0 0.0
3 15.0 33.0 0.0
10 60.0 4.0 20.0
20 57.0 21.0 0.0
12 9.0 36.0 22.0
22 11.0 18.0 0.0
期望的结果:
Job# Expenses Invoice New Job#
1 5.0 2.0 0.0
2 10.0 27.0 0.0
3 15.0 33.0 0.0
10 117.0 25.0 20.0
12 20.0 54.0 22.0
# left merge crcy with jobs to get them on the same row
df = crcy.merge(jobs, how='left')
print(df)
Job# Expenses Invoice New Job#
0 1 5 2 NaN
1 2 10 27 NaN
2 3 15 33 NaN
3 10 60 4 20.0
4 20 57 21 NaN
5 12 9 36 22.0
6 22 11 18 NaN
# then left merge with jobs again but this time match Job# on the left with New Job# on the right
df = df.merge(jobs, left_on='Job#', right_on='New Job#', how='left')
print(df)
Job#_x Expenses Invoice New Job#_x Job#_y New Job#_y
0 1 5 2 NaN NaN NaN
1 2 10 27 NaN NaN NaN
2 3 15 33 NaN NaN NaN
3 10 60 4 20.0 NaN NaN
4 20 57 21 NaN 10.0 20.0
5 12 9 36 22.0 NaN NaN
6 22 11 18 NaN 12.0 22.0
# fill the missing Job#_y with Job#_x
# you'll use this for grouping the old job # and summing it's expenses and invoices
df['Job#_y'].fillna(df['Job#_x'], inplace=True)
print(df)
Job#_x Expenses Invoice New Job#_x Job#_y New Job#_y
0 1 5 2 NaN 1.0 NaN
1 2 10 27 NaN 2.0 NaN
2 3 15 33 NaN 3.0 NaN
3 10 60 4 20.0 10.0 NaN
4 20 57 21 NaN 10.0 20.0
5 12 9 36 22.0 12.0 NaN
6 22 11 18 NaN 12.0 22.0
# do the same with the New Job#_x
df['New Job#_x'].fillna(df['Job#_x'], inplace=True)
print(df)
Job#_x Expenses Invoice New Job#_x Job#_y New Job#_y
0 1 5 2 1.0 1.0 NaN
1 2 10 27 2.0 2.0 NaN
2 3 15 33 3.0 3.0 NaN
3 10 60 4 20.0 10.0 NaN
4 20 57 21 20.0 10.0 20.0
5 12 9 36 22.0 12.0 NaN
6 22 11 18 22.0 12.0 22.0
# group on the Job#_y and New Job#_x since they have no nulls and sum the expenses and invoics
# then reset the index
df = df.groupby([
'Job#_y',
'New Job#_x',
])[[
'Expenses',
'Invoice',
]].sum().reset_index()
print(df)
Job#_y New Job#_x Expenses Invoice
0 1.0 1.0 5 2
1 2.0 2.0 10 27
2 3.0 3.0 15 33
3 10.0 20.0 117 25
4 12.0 22.0 20 54
# replace the _letters using .str.replace and regex=True
df.columns = df.columns.str.replace(r'_[a-z]', '', regex=True)
print(df)
Job# New Job# Expenses Invoice
0 1.0 1.0 5 2
1 2.0 2.0 10 27
2 3.0 3.0 15 33
3 10.0 20.0 117 25
4 12.0 22.0 20 54
# set the New Job# to 0 if it equals the old Job#
df['New Job#'].mask(df['Job#'] == df['New Job#'], 0, inplace=True)
print(df)
Job# New Job# Expenses Invoice
0 1.0 0.0 5 2
1 2.0 0.0 10 27
2 3.0 0.0 15 33
3 10.0 20.0 117 25
4 12.0 22.0 20 54
试试这个:
# Rename the columns for easier reference
jobs.columns = ['Old Job#', 'New Job#']
# For each job, find if it has an old Job#
crcy = crcy.merge(jobs, left_on='Job#', right_on='New Job#', how='left')
# The Job# that goes into the report is the Old Job#, if it has that
crcy['Report Job#'] = crcy['Old Job#'].combine_first(crcy['Job#'])
crcy.groupby('Report Job#').agg({
'Expenses': 'sum',
'Invoice': 'sum',
'Old Job#': 'first'
})
结果:
Expenses Invoice Old Job#
Report Job#
1.0 5 2 NaN
2.0 10 27 NaN
3.0 15 33 NaN
10.0 117 25 10.0
12.0 20 54 12.0