我有两个数据帧:
标签:
import pandas as pd
marker_labels = pd.DataFrame({'cohort_id':[1,1, 1], 'marker_type':['a', 'b', 'a'], 'start':['2020-01-2', '2020-01-04 05', '2020-01-06'], 'end':[np.nan, '2020-01-05 16', np.nan]})
marker_labels['start'] = pd.to_datetime(marker_labels['start'])
marker_labels['end'] = pd.to_datetime(marker_labels['end'])
marker_labels.loc[marker_labels['end'].isnull(), 'end'] = marker_labels.start + pd.Timedelta(days=1) - pd.Timedelta(seconds=1)
和数据:
import pandas as pd
from pandas import Timestamp
df = pd.DataFrame({'hour': {36: Timestamp('2020-01-04 04:00:00'), 37: Timestamp('2020-01-04 04:00:00'), 38: Timestamp('2020-01-04 04:00:00'), 39: Timestamp('2020-01-04 04:00:00'), 40: Timestamp('2020-01-04 04:00:00'), 41: Timestamp('2020-01-04 04:00:00'), 42: Timestamp('2020-01-04 04:00:00'), 43: Timestamp('2020-01-04 04:00:00'), 44: Timestamp('2020-01-04 04:00:00'), 45: Timestamp('2020-01-04 05:00:00'), 46: Timestamp('2020-01-04 05:00:00'), 47: Timestamp('2020-01-04 05:00:00'), 48: Timestamp('2020-01-04 05:00:00'), 49: Timestamp('2020-01-04 05:00:00'), 50: Timestamp('2020-01-04 05:00:00'), 51: Timestamp('2020-01-04 05:00:00'), 52: Timestamp('2020-01-04 05:00:00'), 53: Timestamp('2020-01-04 05:00:00')}, 'metrik_0': {36: -0.30098661551885625, 37: -0.6402837079024638, 38: -2.6953511655638778, 39: 0.4036062912674384, 40: -0.035627996627399204, 41: -0.06510225503176624, 42: -1.9745426914329782, 43: 1.4112111331287631, 44: 0.18641277342651516, 45: 0.10780795451690242, 46: 0.31822895003286417, 47: -1.0804164740649171, 48: -1.6676697601556636, 49: -1.0354359757914047, 50: 1.8570215568670299, 51: 0.9055795225472866, 52: -0.020539970820695173, 53: -0.7975048293123836}, 'cohort_id': {36: 1, 37: 1, 38: 1, 39: 1, 40: 1, 41: 1, 42: 1, 43: 1, 44: 1, 45: 1, 46: 1, 47: 1, 48: 1, 49: 1, 50: 1, 51: 1, 52: 1, 53: 1}, 'device_id': {36: 6, 37: 5, 38: 11, 39: 20, 40: 18, 41: 1, 42: 14, 43: 9, 44: 12, 45: 9, 46: 14, 47: 11, 48: 20, 49: 5, 50: 1, 51: 12, 52: 6, 53: 18}})
df
我想对列coort_id和时间间隔(hour是BETWEEN(start,end((执行LEFT JOIN。
类似的问题有:
- 按间隔合并两个panda数据帧
- 合并panda数据帧,其中一个值位于另两个值之间
到目前为止,我有多种方法,但最终解决方案是:
第一个:速度慢,在简单的Panda列中没有完全输出/可访问的结果:
def join_on_matching_interval(x):
result = marker_labels[(marker_labels.cohort_id == x.cohort_id) & (x.hour >= marker_labels.start) & (x.hour <= marker_labels.end)]
if len(result) == 0:
result = []
return result
df['marker_labels'] = df.apply(join_on_matching_interval, axis=1)
print(df.shape[0])
#df = df.explode('marker_labels') # this fails to work
df['size'] = df.marker_labels.apply(lambda x: len(x))
df[(df['size'] > 0)].head()
如何将结果作为列进行访问?
第二个:正确的列,但无效的行数(和快速(:
按照我上面分享的链接:
print(len(df))
print(len(marker_labels))
merged_res = df.merge(marker_labels, left_on=['cohort_id'], right_on=['cohort_id'], how='left')
print(len(merged_res)) # the number of rows has increased
merged_res = merged_res[(merged_res.hour.between(merged_res.start,merged_res.end)) | (merged_res.start.isnull())]
print(len(merged_res)) # but now not enough rows are left over.
- 情况1:没有匹配(处理正确(
- 案例2:完全匹配(处理正确(
- 情况3:部分匹配(未处理->记录被删除(
特别是对于3,这意味着:
- 我不想收到任何重复
- 左侧的所有结果
- 以及在时间间隔和时间戳重叠的情况下的匹配
我如何将第三种情况包括在条件中?
您的意思是合并和查询,然后返回:
tmp = (df.reset_index()
.merge(marker_labels, on='cohort_id', how='left')
.query('start <= hour <= end')
.set_index('index')
.reindex(df.index)
)
out = tmp.combine_first(df)
输出:
cohort_id device_id end hour marker_type metrik_0 start
-- ----------- ----------- ------------------- ------------------- ------------- ---------- -------------------
36 1 6 NaT 2020-01-04 04:00:00 nan -0.300987 NaT
37 1 5 NaT 2020-01-04 04:00:00 nan -0.640284 NaT
38 1 11 NaT 2020-01-04 04:00:00 nan -2.69535 NaT
39 1 20 NaT 2020-01-04 04:00:00 nan 0.403606 NaT
40 1 18 NaT 2020-01-04 04:00:00 nan -0.035628 NaT
41 1 1 NaT 2020-01-04 04:00:00 nan -0.0651023 NaT
42 1 14 NaT 2020-01-04 04:00:00 nan -1.97454 NaT
43 1 9 NaT 2020-01-04 04:00:00 nan 1.41121 NaT
44 1 12 NaT 2020-01-04 04:00:00 nan 0.186413 NaT
45 1 9 2020-01-05 16:00:00 2020-01-04 05:00:00 b 0.107808 2020-01-04 05:00:00
46 1 14 2020-01-05 16:00:00 2020-01-04 05:00:00 b 0.318229 2020-01-04 05:00:00
47 1 11 2020-01-05 16:00:00 2020-01-04 05:00:00 b -1.08042 2020-01-04 05:00:00
48 1 20 2020-01-05 16:00:00 2020-01-04 05:00:00 b -1.66767 2020-01-04 05:00:00
49 1 5 2020-01-05 16:00:00 2020-01-04 05:00:00 b -1.03544 2020-01-04 05:00:00
50 1 1 2020-01-05 16:00:00 2020-01-04 05:00:00 b 1.85702 2020-01-04 05:00:00
51 1 12 2020-01-05 16:00:00 2020-01-04 05:00:00 b 0.90558 2020-01-04 05:00:00
52 1 6 2020-01-05 16:00:00 2020-01-04 05:00:00 b -0.02054 2020-01-04 05:00:00
53 1 18 2020-01-05 16:00:00 2020-01-04 05:00:00 b -0.797505 2020-01-04 05:00:00