当数据在范围内给定时,使用python进行Vlookup



我有两个excel文件,我想使用python甚至excel执行vlookup并查找成本差异。

我的文件看起来像这个

source_data.xlsx包含所覆盖的距离及其价格,例如,从1到100的距离应收取4800的费用,从101到120的距离应支付5100的费用。

DISTANCE     COST
1-100        4800
101-120      5100
121-140      5500
141-160      5900
161-180      6200
181-200      6600
210-220      6900
221-240      7200

Analysis.xlsx

loading_station  distance_travel     total_cost    status
PUGU                  40                4000       PAID

PUGU                  80                3200       PAID
MOROGORO              50                5000       PAID
MOROGORO              220               30400      PAID
DODOMA                150               5100       PAID
KIGOMA                90                2345       PAID
DODOMA                230               6000       PAID
DODOMA                180               16500      PAID
KIGOMA                32                3000       PAID
DODOMA                45                6000       PAID
DODOMA                65                5000       PAID
KIGOMA                77                1000       PAID
KIGOMA                90                4000       PAID

距离的实际成本在source_data.xlsx中给出,我想检查Analysis.xlsx中的成本是否与实际值相对应,我想检测少付和多付。

所需输出应该是这样的,添加了两列,source_cost是通过使用vlookupsource_xlsx中提取的,差分是total_costsource_cost之间的差

loading_station distance_travel total_cost  status  source_cost Difference
PUGU               40                4000     PAID     4800        -800
PUGU               80                3200     PAID     4800        -1600
MOROGORO           50                5000     PAID     4800         200
MOROGORO           220               30400    PAID     6900         23500
DODOMA             150               5100     PAID     5900         -800
KIGOMA             90                2345     PAID     4800         -2455
DODOMA             230               6000     PAID     7200         -1200
DODOMA             180               16500    PAID     6200          10300
KIGOMA             32                3000     PAID     4800          -1800
DODOMA             45                6000     PAID     4800           1200
DODOMA             65                5000     PAID     4800           200
KIGOMA             77                1000     PAID     4800           -3800
KIGOMA             90                4000     PAID     4800           -800

到目前为止我的代码

# import pandas
import pandas as pd
# read excel data
source_data = pd.read_excel('source_data.xlsx')
analysis_file = pd.read_excel('analysis.xlsx')
source_data.head(5)
analysis_file.head(5)

由于这是一个分类箱问题,我建议使用cut()并找到相应的值。

import pandas as pd
# create bins
bh = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,0]
bt = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,1]
bins = pd.IntervalIndex.from_arrays(bh, bt, closed='both')
print(bins)
###
IntervalIndex([[1, 100], [101, 120], [121, 140], [141, 160], [161, 180], [181, 200], [210, 220], [221, 240]], dtype='interval[int64, both]')

如图所示,IntervalIndexdtype='interval[int64, both]'


# find corresponding values
df_analysis['source_cost'] = pd.cut(df_analysis['distance_travel'], bins=bins).map(dict(zip(bins, df_source['COST']))).astype(int)
# calculation
df_analysis['Difference'] = df_analysis['total_cost'] - df_analysis['source_cost']
print(df_analysis)
###
total_cost>差异已付款-800已付款5900-800>已付款已付款6200已付款已付款200KIGOMAKIGOMA
loading_stationdistance_travel状态source_cost
PUGU4040004800
PUGU
MOROGORO
MOROGORO
DODOMA1505100
KIGOMA
DODOMA23060007200-1200
DODOMA180116500110300
KIGOMA
DODOMA45600048001200
DODOMA6550004800

最新更新