统计Pandas中2个数据帧(通过索引链接)中数据的出现情况



我有两个包含道路事故信息的大数据帧(以下只是摘录(,其中df_veh包含车辆的详细信息,df_ped包含每次事故中涉及的行人数量。veh_type显示了事故中涉及的车辆类型(1=自行车,2=汽车,3=公共汽车(。它们与acc_index相连,表示发生了独特的事故。

veh_data = {'acc_index':  ['001', '002', '002', '003', '003', '004', '005', '005', '006',
'006', '007', '007', '008', '008', '008', '009', '009', '009'],
'veh_type': ['1', '1', '2', '1', '1', '1', '2', '2', '2', '3', '1', '2', '1', '1', 
'1', '1', '2', '2'] }
df_veh = pd.DataFrame (veh_data, columns = ['acc_index', 'veh_type'])       
ped_data = {'acc_index':  ['001', '002', '003', '004', '005', '006', '007', '008', '009'],
'pedestrians': ['1', '2', '0', '1', '4', '3', '0', '1', '2'] }
df_ped = pd.DataFrame (ped_data, columns = ['acc_index', 'pedestrians'])  

我想做的是统计事故数量(由UNIQUEacc_index仅一次(:

  1. 在汽车和自行车之间(veh_type==1veh_type==2(
  2. 自行车和行人之间(veh_type==1pedestrians>=1(
  3. 汽车和行人之间(veh_type==2pedestrians>=1(
  4. 仅在辆车之间(同一acc_index的veh_type==2(
  5. 自行车之间(veh_type==1用于相同的acc_index(
  6. 行人之间(同一acc_index的pedestrians>=1(

我试着用不同的方式做,但最终,我得到了不同的结果,所以我很困惑。例如,我试着统计这样的自行车行人事故:

df_bikes = df_veh[df_veh['veh_type']==1].groupby('acc_index').sum().reset_index()
bike_ped = pd.merge(df_bikes, df_ped, how='outer', on='acc_index')
bike_ped[(bike_ped['veh_type']==1) & (bike_ped['pedestrians']>=1)].groupby(
'acc_index').sum().reset_index()[['acc_index', 'veh_type', 'pedestrians']]

另一个例子,这是我如何计算汽车和自行车之间的事故感谢在这篇文章中的评论。我相信这个至少是正确的。我正试图找到最简单的方法来做到这一点(但也显示已计数的行(。

bike_car = df_veh[def_veh.groupby('acc_index')['veh_type'].
transform(lambda g: not({1, 2} - {*g}))][['acc_index', 'veh_type']]
len(bike_car.groupby(['acc_index']).size().reset_index()))

考虑使用与行人的groupby集合连接的pivot_table来调整车辆数据,然后运行所需的query()调用,其中每行都是不同的acc_index:

veh_dict = {'1': 'bicycle', '2': 'car', '3': 'bus'}
pvt_df = (df_veh.assign(val = 1)
.pivot_table(index = 'acc_index', 
columns = 'veh_type', 
values = 'val', 
aggfunc='sum')
.set_axis([veh_dict[i] for i in list('123')], 
axis = 'columns',  
inplace = False)
.join(df_ped.assign(pedestrians = lambda x: x['pedestrians'].astype('int'))
.groupby('acc_index')['pedestrians']
.sum()
.to_frame(),
how = 'outer'
)
)
pvt_df
#            bicycle  car  bus  pedestrians
# acc_index
# 001            1.0  NaN  NaN            1
# 002            1.0  1.0  NaN            2
# 003            2.0  NaN  NaN            0
# 004            1.0  NaN  NaN            1
# 005            NaN  2.0  NaN            4
# 006            NaN  1.0  1.0            3
# 007            1.0  1.0  NaN            0
# 008            3.0  NaN  NaN            1
# 009            1.0  2.0  NaN            2

查询

# BIKES AND CARS
pvt_df.query('(bicycle >= 1) & (car >= 1)')
#            bicycle  car  bus  pedestrians
# acc_index
# 002            1.0  1.0  0.0            2
# 007            1.0  1.0  0.0            0
# 009            1.0  2.0  0.0            2
# BIKES AND PEDESTRIANS
pvt_df.query('(bicycle >= 1) & (pedestrians >= 1)')
#            bicycle  car  bus  pedestrians
# acc_index
# 001            1.0  0.0  0.0            1
# 002            1.0  1.0  0.0            2
# 004            1.0  0.0  0.0            1
# 008            3.0  0.0  0.0            1
# 009            1.0  2.0  0.0            2
# CARS AND PEDESTRIANS
pvt_df.query('(car >= 1) & (pedestrians > 1)')
#            bicycle  car  bus  pedestrians
# acc_index
# 002            1.0  1.0  0.0            2
# 005            0.0  2.0  0.0            4
# 006            0.0  1.0  1.0            3
# 009            1.0  2.0  0.0            2
### ONLY CARS
pvt_df.query('(bicycle == 0) & (car >= 1) & (bus == 0) & (pedestrians == 0)')
# Empty DataFrame
# Columns: [bicycle, car, bus, pedestrians]
# Index: []
### ONLY BICYCLES
pvt_df.query('(bicycle >= 1) & (car == 0) & (bus == 0) & (pedestrians == 0)')
#            bicycle  car  bus  pedestrians
# acc_index
# 003            2.0  0.0  0.0            0
### ONLY PEDESTRIANS
pvt_df.query('(bicycle == 0) & (car == 0) & (bus == 0) & (pedestrians >= 1)')   
# Empty DataFrame
# Columns: [bicycle, car, bus, pedestrians]
# Index: []

最新更新