使用panda计算群聊数据集中的三方对话数量



我想统计数据集中发生的三方对话的数量。聊天group_x可以由多个成员组成。

什么是三方对话?

  1. 第1路-red_x在组_x中发送消息
  2. 第二路-green_x在同一组_x中回复
  3. 第三路-red_x在同一组_x中发送回复

这可以称为三方对话。

序列必须完全是red_#,green_#,red_#。

什么是接触点?

  1. 触摸点1-red_x的第一条消息
  2. 触摸点2-green_x的第一条信息
  3. 触摸点3-red_x的第二条信息

用于轻松生成我正在使用的示例数据集的代码。

import pandas as pd
from pandas import Timestamp
t1_df = pd.DataFrame({'from_red': [True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True], 
'sent_time': [Timestamp('2021-05-01 06:26:00'), Timestamp('2021-05-04 10:35:00'), Timestamp('2021-05-07 12:16:00'), Timestamp('2021-05-07 12:16:00'), Timestamp('2021-05-09 13:39:00'), Timestamp('2021-05-11 10:02:00'), Timestamp('2021-05-12 13:10:00'), Timestamp('2021-05-12 13:10:00'), Timestamp('2021-05-13 09:46:00'), Timestamp('2021-05-13 22:30:00'), Timestamp('2021-05-14 14:14:00'), Timestamp('2021-05-14 17:08:00'), Timestamp('2021-06-01 09:22:00'), Timestamp('2021-06-01 21:26:00'), Timestamp('2021-06-03 20:19:00'), Timestamp('2021-06-03 20:19:00'), Timestamp('2021-06-09 07:24:00'), Timestamp('2021-05-01 06:44:00'), Timestamp('2021-05-01 08:01:00'), Timestamp('2021-05-01 08:09:00')], 
'w_uid': ['w_000001', 'w_112681', 'w_002516', 'w_002514', 'w_004073', 'w_005349', 'w_006803', 'w_006804', 'w_008454', 'w_009373', 'w_010063', 'w_010957', 'w_066840', 'w_071471', 'w_081446', 'w_081445', 'w_106472', 'w_000002', 'w_111906', 'w_000003'], 
'user_id': ['red_00001', 'green_0263', 'red_01071', 'red_01071', 'red_01552', 'red_01552', 'red_02282', 'red_02282', 'red_02600', 'red_02854', 'red_02854', 'red_02600', 'red_00001', 'red_09935', 'red_10592', 'red_10592', 'red_12292', 'red_00002', 'green_0001', 'red_00003'], 
'group_id': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], 
'touchpoint': [1, 2, 1, 3, 1, 3, 1, 3, 1, 1, 3, 3, 3, 1, 1, 3, 1, 1, 2, 1]}, 
columns = ['from_red', 'sent_time', 'w_uid', 'user_id', 'group_id', 'touchpoint'])
t1_df['sent_time'] = pd.to_datetime(t1_df['sent_time'], format = "%d-%m-%Y")
t1_df

数据集如下所示:

w_uid>触摸点[/tr>w_ 000001<1>绿色_0263w_002516<1>w_0025140103003<1>0133w_066840red_00001w_071471red_09935<1>w_081446红色_105921w_081445红色_105920w_106472<1><2>1
from_redsent_timeuser_idgroup_id
2021-05-01 06:26:00red_000010
错误2021-05-04 10:35:00w_11268102
2021-05-07 12:16:00red_010710
2021-05-07 12:16:00red_0107103
2021-05-09 13:39:00w_004073red_01552
2021-05-11 10:02:00w_005349red_01552
2021-05-12 13:10:00w_006803red_022821
2021-05-12 13:10:00w_006804red_02282
2021-05-13 09:46:00w_008454red_026000
2021-05-13 22:30:00w_009373red_02854
2021-05-14 14:14:00w_010063red_028540
2021-05-14 17:08:00w_010957red_026000
2021-06-01 09:22:0003
2021-06-01 21:26:000
2021-06-03 20:19:000
2021-06-03 20:19:003
2021-06-09 07:24:00red_1229201
2021-05-01 06:44:00w_ 000002red_000021
错误2021-05-01 08:01:00w_111906绿色_0001
2021-05-01 08:09:00w_ 000003red_000031

您可以使用.groupby一次对整个数据集进行操作。

# Get first occurence of sent_time for each group if touchpoint==2
groups = t1_df[t1_df['touchpoint']==2].groupby('group_id')['sent_time'].first()
# Reformat dataframe
groups = groups.reset_index().rename(columns={'sent_time':'first_time'})
# Add these times to dataframe by using 'group_id' as the lookup column
t1_df = t1_df.merge(groups)

# Apply remaining conditional flags
condition = t1_df['sent_time'] > t1_df['first_time']
condition &= t1_df['touchpoint'] == 3
test_df = t1_df[condition].drop('first_time', axis=1)
# Sort by groups, then by sent_times in each group
test_df = test_df.sort_values(['group_id','sent_time'])
# Cleanup
t1_df = t1_df.drop('first_time', axis=1)

基准

对于具有100万行的数据帧,此方法大约在0.52 seconds中运行(简单地重复原始数据帧并递增group_id(。

对我来说,还不清楚如何定义;三方对话";。在组内,如果你有input消息,你认为什么选项是";三方对话";?有几个选项:

Input  : red_0, red_2, green_0, red_1, red_0, red_2, red_1
Option1:        red_2, green_0, red_1
Option2: red_0,        green_0,        red_0
+   :        red_2, green_0,               red_2

以及更多。您的代码示例在绿色后发送时返回用户的第二条消息:

OptionX:               green_0,         red_0
+   :               green_0,               red_2
+   :               green_0,                      red_1

如果某个红色用户在绿色之前发送了一条消息,则不进行跟踪。另一个问题是,如果green在一个组内多次发送,会发生什么。

Input  : red_0, red_2, green_0, green_0, red_1, red_0, green_1, red_1

根据你的描述";序列必须恰好是red_#, green_#, red_#"我想,选项1就是你想要的,也许它甚至独立于颜色:color0_#, color1_#, color0_#。如果我错了,请纠正我;(。

准备DataFrame

为了使操作更通用,我首先准备DataFrame,例如提取用户的颜色,并获得颜色的整数表示

# extract the user color and id
t1_df[['color', 'id']] = t1_df.pop('user_id').str.split('_', expand=True)
# get the dtypes right, also it is not needed here
t1_df.id = t1_df.id.astype(int)
t1_df.color = t1_df.color.astype('category')
# get color as intager
t1_df['color_as_int'] =pd.factorize(t1_df.color)[0]

检测序列color0_#, color1_#, color0_#

# a three way conversation is where color_as_int is [...,a,b,a,...]
# expressed as difference it's color_as_int.diff() is [...,c,-c,...]
# get the difference with tracking the group, therefore first sort
t1_df.sort_values(['group_id', 'sent_time'], inplace=True)
d_color = t1_df.groupby(['group_id']).color_as_int.diff()
m = (d_color != 0) & (d_color == -d_color.shift(-1))  # detect [...,c,-c,...]
# count up for each three way conversation
m[m] = m[m].cumsum()
m = m.astype(int)
# get the labels for the dataframe [...,a,b,a,...]
t1_df['three_way_conversation'] = m + m.shift(1, fill_value=0) + m.shift(-1, fill_value=0)

它返回并适用于任何颜色的

columns = ['sent_time', 'group_id', 'color', 'id', 'touchpoint']
print(t1_df.loc[t1_df['three_way_conversation']>0, columns])
sent_time  group_id  color    id  touchpoint
0  2021-05-01 06:26:00         0    red     1           1
1  2021-05-04 10:35:00         0  green   263           2
2  2021-05-07 12:16:00         0    red  1071           1
17 2021-05-01 06:44:00         1    red     2           1
18 2021-05-01 08:01:00         1  green     1           2
19 2021-05-01 08:09:00         1    red     3           1

奖金

有了DataFrame准备,你可以很容易地计算一个组中每个颜色或用户的消息,或者从一个颜色或用户那里获得消息的第一次和最后一次。CCD_ 12比之后的CCD_ 13和CCD_。

t1_df['color_msg_count'] = t1_df.groupby(['group_id', 'color']).cumcount() + 1
t1_df['user_msg_count'] = t1_df.groupby(['group_id', 'color', 'id']).cumcount() + 1
t1_df['user_sent_time_min'] = t1_df.sort_values('sent_time').groupby(['group_id', 'color', 'id']).sent_time.cummin()
t1_df['user_sent_time_max'] = t1_df.sort_values('sent_time', ascending=False).groupby(['group_id', 'color', 'id']).sent_time.cummax()

最新更新