我已经开始尝试实现方法链来提高Pandas代码的可读性。但是,我很难想出一个解决方案,根据其他三个列中的值的标准分配一个新的计算列。
我想做的是根据发货日期和交付尝试日期计算不同国家、承运人和承运人服务的运输天数。市场、承运人和服务(标准)在发货日期的工作日和交货日期的工作日(计算)上存在差异。
如果没有方法链,我将使用下面的代码来计算过境天数:
- df.loc [(criteria1)、"transit_days"]= np。busday_count(shipped_date, attempt_date, weekmask='1111100')
- df.loc[(criteria2), 'transit_days'] = np。busday_count(shipped_date, attempt_date, weekmask='1111110')
- df.loc[(criteria3), 'transit_days'] = np。busday_count(shipped_date, attempt_date, weekmask='1111111')
方法链,我想出的唯一解决方案是'。方法并使用if/else语句创建自定义函数,这在计算数百万行时并不理想且相当缓慢。有什么建议吗?
编辑,添加代码,进度到目前为止:
df = (pd
.read_csv(filepath_or_buffer='data.csv')
# Re-formating the column names to lowercase and replacing spaces with underscores
.rename(columns=clean_column_names)
# Filtering results to include only express orders
.query('service_level == "Express"')
# Dropping rows without delivery attempt
.dropna(subset=['initial_delivery_attempt_date', 'available_for_pickup_time', 'final_delivery_date'], how='all')
# Assigning new columns
# 1. first_attempt = defining the first attempted delivery date
# 2. transit_days = specific transit days for market
.assign(first_attempt = lambda x: x[['initial_delivery_attempt_date', 'available_for_pickup_time', 'final_delivery_date']].min(axis=1).astype('datetime64[D]'),
transit_days = [STUCK HERE])
)
df
EDIT2,我认为我找到了用loc操作符定义一个函数,通过lambda:
获取数据帧的方法:第一个创建函数:
def calculate_transit_days(df):
df = df.copy()
mask1 = (df['carrier'] == 'Carrier_1')
mask2 = (df['carrier'] == 'Carrier_2')
df.loc[mask1, 'new_column'] = np.busday_count(df.loc[mask1, 'shipped_time'].values.astype('datetime64[D]'), df.loc[mask1, 'first_attempt'].values.astype('datetime64[D]'), weekmask='1111111')
df.loc[mask2, 'new_column'] = np.busday_count(df.loc[mask2, 'shipped_time'].values.astype('datetime64[D]'), df.loc[mask2, 'first_attempt'].values.astype('datetime64[D]'), weekmask='1111110')
return df['new_column']
然后在lambda中赋值函数:
df = (pd
.read_csv(filepath_or_buffer='data.csv')
# Re-formating the column names to lowercase and replacing spaces with underscores
.rename(columns=clean_column_names)
# Filtering results to include only express orders
.query('customer_level_of_service == "Express"')
# Dropping rows without delivery attempt
.dropna(subset=['initial_delivery_attempt_date', 'available_for_pickup_time', 'delivery_date'], how='all')
# Finding the first attempt date
.assign(first_attempt = lambda x: x[['initial_delivery_attempt_date', 'available_for_pickup_time', 'delivery_date']].min(axis=1),
transit_days = lambda x: calculate_transit_days(x))
)
EDIT3:
另外,与其用.loc &面具,我注意到np。Select在这里很有用。首先将所有的np.select(条件)定义为一个适用于weekmask('1111110')和'(1111111')的列表,然后使用np.select(choicelist)运行正确的busday_count函数,并将weekmask('1111100')作为默认值。
在下面的例子中,我列出了两个简化的条件标准和两个np。Busday_count逻辑,所以两个列表的长度都是2。那么默认的周掩码是星期一-星期五('1111100')。
def calculate_transit_days(df):
df = df.copy()
# Defining conditions into a list
conditions = [
# Conditions 1
((df['carrier'] == 'Carrier_1') | (df['shipped_time'].dt.weekday == 6)) |
((df['carrier'] == 'Carrier_2') | (df['shipped_time'].dt.weekday == 6)),
# Conditions 2
((df['carrier'] == 'Carrier_3'))
]
# Defining calculation for conditions
choices = [
# Choice for condition 1
(np.busday_count(df['shipped_time'].values.astype('datetime64[D]'), df['first_attempt'].values.astype('datetime64[D]'), weekmask='1111111')),
# Choice for condition 2
(np.busday_count(df['shipped_time'].values.astype('datetime64[D]'), df['first_attempt'].values.astype('datetime64[D]'), weekmask='1111110'))
]
df['new_column'] = np.select(
condlist = conditions,
choicelist = choices,
default = np.busday_count(df['shipped_time'].values.astype('datetime64[D]'), df['first_attempt'].values.astype('datetime64[D]'), weekmask='1111100')
)
return df['new_column']
脑海中浮现的想法并不比你拥有的3个相当清晰的说明更清晰。
例如,可以计算这三种方法,并将它们分别乘以条件。
df['transit_days'] = (
criteria1 * np.busday_count(shipped_date, attempt_date, weekmask='1111100') +
criteria2 * np.busday_count(shipped_date, attempt_date, weekmask='1111110') +
criteria3 * np.busday_count(shipped_date, attempt_date, weekmask='1111111')
)