我有两个excel文件,我想使用python甚至excel执行vlookup并查找成本差异。
我的文件看起来像这个
source_data.xlsx包含所覆盖的距离及其价格,例如,从1到100的距离应收取4800的费用,从101到120的距离应支付5100的费用。
DISTANCE COST
1-100 4800
101-120 5100
121-140 5500
141-160 5900
161-180 6200
181-200 6600
210-220 6900
221-240 7200
Analysis.xlsx
loading_station distance_travel total_cost status
PUGU 40 4000 PAID
PUGU 80 3200 PAID
MOROGORO 50 5000 PAID
MOROGORO 220 30400 PAID
DODOMA 150 5100 PAID
KIGOMA 90 2345 PAID
DODOMA 230 6000 PAID
DODOMA 180 16500 PAID
KIGOMA 32 3000 PAID
DODOMA 45 6000 PAID
DODOMA 65 5000 PAID
KIGOMA 77 1000 PAID
KIGOMA 90 4000 PAID
距离的实际成本在source_data.xlsx
中给出,我想检查Analysis.xlsx
中的成本是否与实际值相对应,我想检测少付和多付。
所需输出应该是这样的,添加了两列,source_cost
是通过使用vlookup
从source_xlsx
中提取的,差分是total_cost
和source_cost
之间的差
loading_station distance_travel total_cost status source_cost Difference
PUGU 40 4000 PAID 4800 -800
PUGU 80 3200 PAID 4800 -1600
MOROGORO 50 5000 PAID 4800 200
MOROGORO 220 30400 PAID 6900 23500
DODOMA 150 5100 PAID 5900 -800
KIGOMA 90 2345 PAID 4800 -2455
DODOMA 230 6000 PAID 7200 -1200
DODOMA 180 16500 PAID 6200 10300
KIGOMA 32 3000 PAID 4800 -1800
DODOMA 45 6000 PAID 4800 1200
DODOMA 65 5000 PAID 4800 200
KIGOMA 77 1000 PAID 4800 -3800
KIGOMA 90 4000 PAID 4800 -800
到目前为止我的代码
# import pandas
import pandas as pd
# read excel data
source_data = pd.read_excel('source_data.xlsx')
analysis_file = pd.read_excel('analysis.xlsx')
source_data.head(5)
analysis_file.head(5)
由于这是一个分类箱问题,我建议使用cut()
并找到相应的值。
import pandas as pd
# create bins
bh = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,0]
bt = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,1]
bins = pd.IntervalIndex.from_arrays(bh, bt, closed='both')
print(bins)
###
IntervalIndex([[1, 100], [101, 120], [121, 140], [141, 160], [161, 180], [181, 200], [210, 220], [221, 240]], dtype='interval[int64, both]')
如图所示,IntervalIndex
、dtype='interval[int64, both]'
# find corresponding values
df_analysis['source_cost'] = pd.cut(df_analysis['distance_travel'], bins=bins).map(dict(zip(bins, df_source['COST']))).astype(int)
# calculation
df_analysis['Difference'] = df_analysis['total_cost'] - df_analysis['source_cost']
print(df_analysis)
###
loading_station | distance_travel | total_cost状态 | >source_cost | 差异|
---|---|---|---|---|
PUGU | 40 | 4000 | 已付款4800 | -800|
PUGU | ||||
MOROGORO | ||||
MOROGORO | ||||
DODOMA | 150 | 5100 | 已付款5900-800||
KIGOMA | ||||
DODOMA | 230 | >6000 | 已付款7200 | -1200 |
DODOMA | 180 | 116500 | 已付款6200110300 | |
KIGOMA | ||||
DODOMA | 45 | 6000 | 已付款4800 | 1200 |
DODOMA | 65 | 5000 | 已付款4800 | 200KIGOMAKIGOMA