当数据共享一列值时,有没有办法将数据压缩到Pandas DataFrame的一行中



我有一个有几千行的DataFrame。DF保存我所在组织内各单位的单位标识符和响应时间。它在DF中被构造为具有列["事件#"、"单元ID"、"第一单元路线"、"到达的第一单元"、"在医院的第一个单元"]

同一事件#有许多不同的行,最后我只想要每个事件#一行,其中有["First UnitEnroute"、"First UnitArrived"、"First UnitAtHospital]";由具有相同事件#的其他行填写。

造成这种情况的原因是季度末的账单惨败,我们需要知道这些不同的活动是否在不同的单位中有这3次。不过,我不需要列出单位,只需要从同一事件的其他行中提取的第一个非0值。

以下是一些示例数据:

Event#      Unit    First UnitEnroute           First UnitArrived           First UnitAtHospital
2020000394    37    ['1/1/2020', '10:45:34 PM'] ['1/1/2020', '10:48:33 PM'] ['1/1/2020', '11:45:01 PM']
2020000394    38    ['1/1/2020', '10:45:34 PM'] ['1/1/2020', '10:48:33 PM'] ['1/1/2020', '11:45:01 PM']     
2020000394    36    ['1/1/2020', '10:45:34 PM'] ['1/1/2020', '10:48:33 PM'] ['1/1/2020', '11:45:01 PM']     
2020000394    39    ['1/1/2020', '10:45:34 PM'] ['1/1/2020', '10:48:33 PM'] ['1/1/2020', '11:45:01 PM']     
2020000617    58    ['1/2/2020', '12:06:13 PM'] ['1/2/2020', '12:07:39 PM'] ['1/2/2020', '12:43:10 PM']     
2020000849    74    ['1/2/2020', '6:42:19 PM']  ['1/2/2020', '6:53:53 PM']  ['1/2/2020', '7:28:32 PM']      
2020000849    75    ['0']                       ['0']                       ['0']
2020000927    81    ['0']                       ['0']                       ['0']
2020000927    80    ['0']                       ['0']                       ['0']
2020000997    86    ['0']                       ['0']                       ['0']
2020000997    87    ['0']                       ['0']                       ['0']
2020001218    99    ['1/3/2020', '11:50:39 AM'] ['1/3/2020', '11:52:40 AM'] ['1/3/2020', '12:29:37 PM']     
2020001218    98    ['0']                       ['1/3/2020', '11:52:40 AM'] ['0']
2020001255    102   ['1/3/2020', '12:44:30 PM'] ['0']                       ['0']
2020001255    103   ['1/3/2020', '12:40:19 PM'] ['0']                       ['0']
2020001258    98    ['1/3/2020', '12:49:00 PM'] ['1/3/2020', '12:57:22 PM'] ['1/3/2020', '1:39:03 PM']      
2020001258    103   ['0']                       ['0']                       ['0']
2020001258    104   ['0']                       ['0']                       ['0']
2020001258    105   ['0']                       ['0']                       ['0']

这就是我尝试过的:

  1. Brute强制遍历每一行以找到一个非0值,然后将该值附加到该行。

    for row in DF:
    compare = list()
    for i in DF:
    if i[0] == row[0]:
    addition = list(i)
    compare = compare.append(addition)
    print("Compare: {}".format(compare))
    return compare
    for el in row.index:
    whatisit = row[el]
    if whatisit == 0:
    for item in compare.index:
    if item[el] == 0:
    return
    else:
    replacement = item[el]
    print("Replacement: {}".format(replacement))
    return replacement
    row[el] = replacement
    return DF
    
  2. 使用pandas.groupby((.fillna((对具有类似事件#的行进行分组,并随时间回填。--我没有得到任何可以接受的结果。大多数时候,它会把我的DF切成一些不稳定的形状。我可能误解了如何使用这个。

DF = DF.groupby("Event#")["Unit", "First UnitEnroute", "First UnitArrived", "First UnitAtHospital"].fillna(method="bfill")

任何方向都很感激,很抱歉,如果这之前已经发布过,我花了很多时间寻找潜在的答案。我想我还没有完全发展出我需要看到代码的直觉,看看我如何将其应用到我的项目中。我不是一个专业的开发人员,我更像是一个动手搬运重物的员工哈哈。

这是一个带有IIUC的。

from io import StringIO
import pandas as pd
# create data frame
df = pd.read_csv(StringIO(data), sep='ss+', engine='python')
# drop the column `Unit`
df = df.drop(columns='Unit')
# re-shape
df = df.melt(id_vars='Event#', var_name='first_unit', value_name='timestamp')
# drop timestamp == ['0']
mask = df['timestamp'].astype(str) != "['0']"
df = df[mask]
# drop duplicates
df = df.drop_duplicates()
# get min value for each group -- and re-shape
df = (df.groupby(['Event#', 'first_unit'])['timestamp'].min()
.unstack(level='first_unit')
.reset_index()
)
print(df)
first_unit      Event#            First UnitArrived  
0           2020000394  ['1/1/2020', '10:48:33 PM']   
1           2020000617  ['1/2/2020', '12:07:39 PM']   
2           2020000849   ['1/2/2020', '6:53:53 PM']   
3           2020001218  ['1/3/2020', '11:52:40 AM']   
4           2020001255                          NaN   
5           2020001258  ['1/3/2020', '12:57:22 PM']   
first_unit         First UnitAtHospital            First UnitEnroute  
0           ['1/1/2020', '11:45:01 PM']  ['1/1/2020', '10:45:34 PM']  
1           ['1/2/2020', '12:43:10 PM']  ['1/2/2020', '12:06:13 PM']  
2            ['1/2/2020', '7:28:32 PM']   ['1/2/2020', '6:42:19 PM']  
3           ['1/3/2020', '12:29:37 PM']  ['1/3/2020', '11:50:39 AM']  
4                                   NaN  ['1/3/2020', '12:40:19 PM']  
5            ['1/3/2020', '1:39:03 PM']  ['1/3/2020', '12:49:00 PM'] 

以下是原始数据(即,用于创建数据帧(:

data = '''Event#      Unit    First UnitEnroute           First UnitArrived           First UnitAtHospital
2020000394    37    ['1/1/2020', '10:45:34 PM']  ['1/1/2020', '10:48:33 PM']  ['1/1/2020', '11:45:01 PM']
2020000394    38    ['1/1/2020', '10:45:34 PM']  ['1/1/2020', '10:48:33 PM']  ['1/1/2020', '11:45:01 PM']     
2020000394    36    ['1/1/2020', '10:45:34 PM']  ['1/1/2020', '10:48:33 PM']  ['1/1/2020', '11:45:01 PM']     
2020000394    39    ['1/1/2020', '10:45:34 PM']  ['1/1/2020', '10:48:33 PM']  ['1/1/2020', '11:45:01 PM']     
2020000617    58    ['1/2/2020', '12:06:13 PM']  ['1/2/2020', '12:07:39 PM']  ['1/2/2020', '12:43:10 PM']     
2020000849    74    ['1/2/2020', '6:42:19 PM']   ['1/2/2020', '6:53:53 PM']   ['1/2/2020', '7:28:32 PM']      
2020000849    75    ['0']                        ['0']                        ['0']
2020000927    81    ['0']                        ['0']                        ['0']
2020000927    80    ['0']                        ['0']                        ['0']
2020000997    86    ['0']                        ['0']                        ['0']
2020000997    87    ['0']                        ['0']                        ['0']
2020001218    99    ['1/3/2020', '11:50:39 AM']  ['1/3/2020', '11:52:40 AM']  ['1/3/2020', '12:29:37 PM']     
2020001218    98    ['0']                        ['1/3/2020', '11:52:40 AM']  ['0']
2020001255    102   ['1/3/2020', '12:44:30 PM']  ['0']                        ['0']
2020001255    103   ['1/3/2020', '12:40:19 PM']  ['0']                        ['0']
2020001258    98    ['1/3/2020', '12:49:00 PM']  ['1/3/2020', '12:57:22 PM']  ['1/3/2020', '1:39:03 PM']      
2020001258    103   ['0']                        ['0']                        ['0']
2020001258    104   ['0']                        ['0']                        ['0']
2020001258    105   ['0']                        ['0']                        ['0']
'''

由于您有几千行,我建议单独处理每一列,并再次将它们合并在一起:

df1 = df[ df['First UnitEnroute']!="['0']" ][['Event#', 'First UnitEnroute']]
df1 = df1[~df1.duplicated(['Event#'])]
df2 = df[ df['First UnitArrived']!="['0']" ][['Event#', 'First UnitArrived']]
df2 = df2[~df2.duplicated(['Event#'])]
df3 = df[ df['First UnitAtHospital']!="['0']" ][['Event#', 'First UnitAtHospital']]
df3 = df3[~df3.duplicated(['Event#'])]
df_result = df1.merge(df2, on = 'Event#', how='left').merge(df3, on = 'Event#', how='left')

通过这种方式(如果我正确理解了这个问题(,你可以找到没有一个或多个第一单元统计数据的时间戳的事件。您的样本中哪一个是事件2020001255

最新更新