我的测试数据帧:
print("Create List", end='n')
Test_Data = [('display_name', ['A', 'B', 'B','C','C','C','C','C',]),
('security_type1', ['GOVT', 'CORP','CORP','CORP','CORP','CORP','CORP','CORP']),
('currency_str', ['USD', 'NZD','USD','EUR','EUR','GBP','GBP','USD']),
('state', ['Done','Passed','Done','Done','Traded Away','Done','Done','Done']),
('rfq_qty_CAD_Equiv', [100000, 100000, 100000,100000,100000,100000,100000,100000]),
]
dfTest_Data = pd.DataFrame.from_items(Test_Data)
display(dfTest_Data)
display_name security_type1 currency_str state rfq_qty_CAD_Equiv
A GOVT USD Done 100000
B CORP NZD Passed 100000
B CORP USD Done 100000
C CORP EUR Done 100000
C CORP EUR Traded Away 100000
C CORP GBP Done 100000
C CORP GBP Done 100000
C CORP USD Done 100000
我想要的输出如下。这里的驱动程序按display_name
、security_type1
和currency_str
分组。Total_RFQ
和Total_RFQ_Volume
是关于display_name
display_name security_type1 currency_str Done_RFQ Not_Done_RFQ Total_RFQ
A GOVT USD 1 0 1
B CORP USD 1 1 2
C CORP EUR 1 1 5
C CORP GBP 2 0 5
C CORP USD 1 0 5
Hit_Rate Done_RFQ_Volume Not_Done_RFQ_Volume Total_RFQ_Volume
1.00 100000 0 100000
0.50 100000 100000 200000
0.20 100000 100000 500000
0.40 200000 0 500000
0.20 100000 0 500000
Volume_per_Done_RFQ Volume_per_Not_Done_RFQ Volume_per_Total_RFQ
100000 0 100000
100000 100000 100000
100000 100000 100000
100000 0 100000
100000 0 100000
Hit_Rate
=Done_RFQ
/Total_RFQ
Volume_per_Done_RFQ
=Done_RFQ_Volume
/Done_RFQ
Volume_per_Not_Done_RFQ
=Not_Done_RFQ_Volume
/Not_Done_RFQ
Volume_per_Total_RFQ
=Total_RFQ_Volume
/Total_RFQ
大部分工作已经完成,我只是在合并第三个数据帧和显示需要零条目的Not_Done行项目时遇到问题
print("All Trades", end='n')
d = [
('Total_RFQ_Volume', 'sum'),
('Total_RFQ', 'size'),
]
df1 = dfTest_Data.groupby(['display_name'])['rfq_qty_CAD_Equiv'].agg(d)
display (df1)
print("Done Trades", end='n')
d = [
('Done_RFQ_Volume', 'sum'),
('Done_RFQ', 'size'),
]
mask = dfTest_Data['state'].str.contains('Done')
df2 = dfTest_Data[mask].groupby(['display_name','security_type1','currency_str'])['rfq_qty_CAD_Equiv'].agg(d).reset_index()
display (df2)
print("Not Done Trades", end='n')
d = [
('Not_Done_RFQ_Volume', 'sum'),
('Not_Done_RFQ', 'size'),
]
mask = ~dfTest_Data['state'].str.contains('Done')
df3 = dfTest_Data[mask].groupby(['display_name','security_type1','currency_str'])['rfq_qty_CAD_Equiv'] .agg(d) .reset_index()
display (df3)
print("Join Done trades on All Trades", end='n')
df_Done_Client_Hit_Rate_Volume = df2.join(df1, on='display_name').join(df3, on='display_name')
# Create additional calculated columns
df_Done_Client_Hit_Rate_Volume['Hit_Rate'] = df_Done_Client_Hit_Rate_Volume['Done_RFQ'] / df_Done_Client_Hit_Rate_Volume['Total_RFQ']
df_Done_Client_Hit_Rate_Volume['Volume_per_Done_RFQ'] = df_Done_Client_Hit_Rate_Volume['Done_RFQ_Volume'] / df_Done_Client_Hit_Rate_Volume['Done_RFQ']
df_Done_Client_Hit_Rate_Volume['Volume_per_Not_Done_RFQ'] = df_Done_Client_Hit_Rate_Volume['Not_Done_RFQ_Volume'] / df_Done_Client_Hit_Rate_Volume['Not_Done_RFQ']
df_Done_Client_Hit_Rate_Volume['Volume_per_Total_RFQ'] = df_Done_Client_Hit_Rate_Volume['Total_RFQ_Volume'] / df_Done_Client_Hit_Rate_Volume['Total_RFQ']
# Reorder columns
df_Done_Client_Hit_Rate_Volume = df_Done_Client_Hit_Rate_Volume[['display_name',
'security_type1',
'currency_str',
'Done_RFQ',
'Not_Done_RFQ',
'Total_RFQ',
'Hit_Rate',
'Done_RFQ_Volume',
'Not_Done_RFQ_Volume',
'Volume_per_Done_RFQ',
'Volume_per_Not_Done_RFQ',
'Total_RFQ_Volume'
'Volume_per_Total_RFQ'
]]
display (df_Done_Client_Hit_Rate_Volume)
我认为首先需要删除.reset_index
以进行df2
和df3
:
d = [
('Done_RFQ_Volume', 'sum'),
('Done_RFQ', 'size'),
]
mask = dfTest_Data['state'].str.contains('Done')
df2 = dfTest_Data[mask].groupby(['display_name','security_type1','currency_str'])['rfq_qty_CAD_Equiv'].agg(d)
#print (df2)
print("Not Done Trades", end='n')
d = [
('Not_Done_RFQ_Volume', 'sum'),
('Not_Done_RFQ', 'size'),
]
mask = ~dfTest_Data['state'].str.contains('Done')
df3 = dfTest_Data[mask].groupby(['display_name','security_type1','currency_str'])['rfq_qty_CAD_Equiv'].agg(d)
然后通过concat
和join
将DataFrame
连接在一起:
df = pd.concat([df2, df3],axis=1).reset_index()
df_Done_Client_Hit_Rate_Volume = df.join(df1, on='display_name')
df_Done_Client_Hit_Rate_Volume['Hit_Rate'] = df_Done_Client_Hit_Rate_Volume['Done_RFQ'] / df_Done_Client_Hit_Rate_Volume['Total_RFQ']
df_Done_Client_Hit_Rate_Volume['Volume_per_Done_RFQ'] = df_Done_Client_Hit_Rate_Volume['Done_RFQ_Volume'] / df_Done_Client_Hit_Rate_Volume['Done_RFQ']
df_Done_Client_Hit_Rate_Volume['Volume_per_Not_Done_RFQ'] = df_Done_Client_Hit_Rate_Volume['Not_Done_RFQ_Volume'] / df_Done_Client_Hit_Rate_Volume['Not_Done_RFQ']
df_Done_Client_Hit_Rate_Volume['Volume_per_Total_RFQ'] = df_Done_Client_Hit_Rate_Volume['Total_RFQ_Volume'] / df_Done_Client_Hit_Rate_Volume['Total_RFQ']
最后将缺失值替换为0
:
df_Done_Client_Hit_Rate_Volume = df_Done_Client_Hit_Rate_Volume.fillna(0)
print (df_Done_Client_Hit_Rate_Volume)
display_name security_type1 currency_str Done_RFQ_Volume Done_RFQ
0 A GOVT USD 100000.0 1.0
1 B CORP NZD 0.0 0.0
2 B CORP USD 100000.0 1.0
3 C CORP EUR 100000.0 1.0
4 C CORP GBP 200000.0 2.0
5 C CORP USD 100000.0 1.0
Not_Done_RFQ_Volume Not_Done_RFQ Total_RFQ_Volume Total_RFQ Hit_Rate
0 0.0 0.0 100000 1 1.0
1 100000.0 1.0 200000 2 0.0
2 0.0 0.0 200000 2 0.5
3 100000.0 1.0 500000 5 0.2
4 0.0 0.0 500000 5 0.4
5 0.0 0.0 500000 5 0.2
Volume_per_Done_RFQ Volume_per_Not_Done_RFQ Volume_per_Total_RFQ
0 100000.0 0.0 100000.0
1 0.0 100000.0 100000.0
2 100000.0 0.0 100000.0
3 100000.0 100000.0 100000.0
4 100000.0 0.0 100000.0
5 100000.0 0.0 100000.0