我想统计数据集中发生的三方对话的数量。聊天group_x可以由多个成员组成。

什么是三方对话？

第1路-red_x在组_x中发送消息
第二路-green_x在同一组_x中回复
第三路-red_x在同一组_x中发送回复

这可以称为三方对话。

序列必须完全是red_#，green_#，red_#。

什么是接触点？

触摸点1-red_x的第一条消息
触摸点2-green_x的第一条信息
触摸点3-red_x的第二条信息

用于轻松生成我正在使用的示例数据集的代码。

import pandas as pd
from pandas import Timestamp
t1_df = pd.DataFrame({'from_red': [True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True], 
'sent_time': [Timestamp('2021-05-01 06:26:00'), Timestamp('2021-05-04 10:35:00'), Timestamp('2021-05-07 12:16:00'), Timestamp('2021-05-07 12:16:00'), Timestamp('2021-05-09 13:39:00'), Timestamp('2021-05-11 10:02:00'), Timestamp('2021-05-12 13:10:00'), Timestamp('2021-05-12 13:10:00'), Timestamp('2021-05-13 09:46:00'), Timestamp('2021-05-13 22:30:00'), Timestamp('2021-05-14 14:14:00'), Timestamp('2021-05-14 17:08:00'), Timestamp('2021-06-01 09:22:00'), Timestamp('2021-06-01 21:26:00'), Timestamp('2021-06-03 20:19:00'), Timestamp('2021-06-03 20:19:00'), Timestamp('2021-06-09 07:24:00'), Timestamp('2021-05-01 06:44:00'), Timestamp('2021-05-01 08:01:00'), Timestamp('2021-05-01 08:09:00')], 
'w_uid': ['w_000001', 'w_112681', 'w_002516', 'w_002514', 'w_004073', 'w_005349', 'w_006803', 'w_006804', 'w_008454', 'w_009373', 'w_010063', 'w_010957', 'w_066840', 'w_071471', 'w_081446', 'w_081445', 'w_106472', 'w_000002', 'w_111906', 'w_000003'], 
'user_id': ['red_00001', 'green_0263', 'red_01071', 'red_01071', 'red_01552', 'red_01552', 'red_02282', 'red_02282', 'red_02600', 'red_02854', 'red_02854', 'red_02600', 'red_00001', 'red_09935', 'red_10592', 'red_10592', 'red_12292', 'red_00002', 'green_0001', 'red_00003'], 
'group_id': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], 
'touchpoint': [1, 2, 1, 3, 1, 3, 1, 3, 1, 1, 3, 3, 3, 1, 1, 3, 1, 1, 2, 1]}, 
columns = ['from_red', 'sent_time', 'w_uid', 'user_id', 'group_id', 'touchpoint'])
t1_df['sent_time'] = pd.to_datetime(t1_df['sent_time'], format = "%d-%m-%Y")
t1_df

数据集如下所示：

w_uid>触摸点[/tr>w_ 000001<1>绿色_0263w_002516<1>w_0025140103003<1>0133w_066840red_00001w_071471red_09935<1>w_081446红色_105921w_081445红色_105920w_106472<1><2>1

from_red	sent_time	user_id	group_id
真	2021-05-01 06:26:00	red_00001	0
错误	2021-05-04 10:35:00	w_112681	0	2
真	2021-05-07 12:16:00	red_01071	0
真	2021-05-07 12:16:00	red_01071	0	3
真	2021-05-09 13:39:00	w_004073	red_01552
真	2021-05-11 10:02:00	w_005349	red_01552
真	2021-05-12 13:10:00	w_006803	red_02282	1
真	2021-05-12 13:10:00	w_006804	red_02282
真	2021-05-13 09:46:00	w_008454	red_02600	0
真	2021-05-13 22:30:00	w_009373	red_02854
真	2021-05-14 14:14:00	w_010063	red_02854	0
真	2021-05-14 17:08:00	w_010957	red_02600	0
真	2021-06-01 09:22:00	0	3
真	2021-06-01 21:26:00	0
真	2021-06-03 20:19:00	0
真	2021-06-03 20:19:00	3
真	2021-06-09 07:24:00	red_12292	0	1
真	2021-05-01 06:44:00	w_ 000002	red_00002	1
错误	2021-05-01 08:01:00	w_111906	绿色_0001
真	2021-05-01 08:09:00	w_ 000003	red_00003	1

您可以使用.groupby一次对整个数据集进行操作。

# Get first occurence of sent_time for each group if touchpoint==2
groups = t1_df[t1_df['touchpoint']==2].groupby('group_id')['sent_time'].first()
# Reformat dataframe
groups = groups.reset_index().rename(columns={'sent_time':'first_time'})
# Add these times to dataframe by using 'group_id' as the lookup column
t1_df = t1_df.merge(groups)

# Apply remaining conditional flags
condition = t1_df['sent_time'] > t1_df['first_time']
condition &= t1_df['touchpoint'] == 3
test_df = t1_df[condition].drop('first_time', axis=1)
# Sort by groups, then by sent_times in each group
test_df = test_df.sort_values(['group_id','sent_time'])
# Cleanup
t1_df = t1_df.drop('first_time', axis=1)

基准

对于具有100万行的数据帧，此方法大约在0.52 seconds中运行(简单地重复原始数据帧并递增group_id(。

对我来说，还不清楚如何定义；三方对话"；。在组内，如果你有input消息，你认为什么选项是"；三方对话"；？有几个选项：

Input  : red_0, red_2, green_0, red_1, red_0, red_2, red_1
Option1:        red_2, green_0, red_1
Option2: red_0,        green_0,        red_0
+   :        red_2, green_0,               red_2

以及更多。您的代码示例在绿色后发送时返回用户的第二条消息：

OptionX:               green_0,         red_0
+   :               green_0,               red_2
+   :               green_0,                      red_1

如果某个红色用户在绿色之前发送了一条消息，则不进行跟踪。另一个问题是，如果green在一个组内多次发送，会发生什么。

Input  : red_0, red_2, green_0, green_0, red_1, red_0, green_1, red_1

根据你的描述"；序列必须恰好是red_#, green_#, red_#"我想，选项1就是你想要的，也许它甚至独立于颜色：color0_#, color1_#, color0_#。如果我错了，请纠正我；(。

准备DataFrame

为了使操作更通用，我首先准备DataFrame，例如提取用户的颜色，并获得颜色的整数表示

# extract the user color and id
t1_df[['color', 'id']] = t1_df.pop('user_id').str.split('_', expand=True)
# get the dtypes right, also it is not needed here
t1_df.id = t1_df.id.astype(int)
t1_df.color = t1_df.color.astype('category')
# get color as intager
t1_df['color_as_int'] =pd.factorize(t1_df.color)[0]

检测序列`color0_#, color1_#, color0_#`

# a three way conversation is where color_as_int is [...,a,b,a,...]
# expressed as difference it's color_as_int.diff() is [...,c,-c,...]
# get the difference with tracking the group, therefore first sort
t1_df.sort_values(['group_id', 'sent_time'], inplace=True)
d_color = t1_df.groupby(['group_id']).color_as_int.diff()
m = (d_color != 0) & (d_color == -d_color.shift(-1))  # detect [...,c,-c,...]
# count up for each three way conversation
m[m] = m[m].cumsum()
m = m.astype(int)
# get the labels for the dataframe [...,a,b,a,...]
t1_df['three_way_conversation'] = m + m.shift(1, fill_value=0) + m.shift(-1, fill_value=0)

它返回并适用于任何颜色的

columns = ['sent_time', 'group_id', 'color', 'id', 'touchpoint']
print(t1_df.loc[t1_df['three_way_conversation']>0, columns])
sent_time  group_id  color    id  touchpoint
0  2021-05-01 06:26:00         0    red     1           1
1  2021-05-04 10:35:00         0  green   263           2
2  2021-05-07 12:16:00         0    red  1071           1
17 2021-05-01 06:44:00         1    red     2           1
18 2021-05-01 08:01:00         1  green     1           2
19 2021-05-01 08:09:00         1    red     3           1

奖金

有了DataFrame准备，你可以很容易地计算一个组中每个颜色或用户的消息，或者从一个颜色或用户那里获得消息的第一次和最后一次。CCD_ 12比之后的CCD_ 13和CCD_。

t1_df['color_msg_count'] = t1_df.groupby(['group_id', 'color']).cumcount() + 1
t1_df['user_msg_count'] = t1_df.groupby(['group_id', 'color', 'id']).cumcount() + 1
t1_df['user_sent_time_min'] = t1_df.sort_values('sent_time').groupby(['group_id', 'color', 'id']).sent_time.cummin()
t1_df['user_sent_time_max'] = t1_df.sort_values('sent_time', ascending=False).groupby(['group_id', 'color', 'id']).sent_time.cummax()

使用panda计算群聊数据集中的三方对话数量

准备DataFrame

检测序列`color0_#, color1_#, color0_#`

奖金

相关内容

最新更新

热门标签：

使用panda计算群聊数据集中的三方对话数量

准备DataFrame

检测序列color0_#, color1_#, color0_#

奖金

相关内容

最新更新

热门标签：

检测序列`color0_#, color1_#, color0_#`