python group chat id



我有一个聊天数据集,我想进行对话组并计算他们发送了多少消息。

这是我的数据。此数据是 " id"的聊天日志,其名称是吉米(Jimmy)。

Sender      Receiver   Text
ID          person1    HI
person1     ID         Hello~
ID          person1    My name is Jimmy
person1     ID         Nice to meet you!
ID          person1    Nice to meet you, too
ID          person2    Hi
person1     ID         Hi there
ID          person2    My name is Jimmy
person1     ID         My name is Abi
ID          person2    Nice to meet you
...         ....       .....

" id"可以与多个人聊天。
我想计算每个对话的消息数量。
在这种情况下,两次对话都有5条消息。

我已经编写了代码,但是由于我的数据很大,因此效率低下。

    #chat_df is the dataframe of chat data
    df = []
    total_message =[]
    receiver_id = chat_df["receiver"].unique()
    for x in rid:
        total_message.append(len(chat_df[(chat_df["receiver"] == x) | (chat_df["sender"] == x)]))
        df.append(chat_df[(chat_df["receiver"] == x) | (chat_df["sender"] == x)])

是否有更有效的方法来获取这对两个人的聊天数据?

我认为您需要 stack value_counts

df1 = chat_df[['Sender','Receiver']].stack().value_counts().reset_index()
df1.columns = ['People','Counts']
print (df1)
    People  Counts
0       ID      10
1  person1       7
2  person2       3

编辑:

#get number of all words
chat_df['Len'] = chat_df.Text.str.split().str.len()
#reshape dataframe
chat_df = chat_df.set_index('Len')[['Sender','Receiver']].stack().reset_index(name='People')
print (chat_df)
    Len   level_1   People
0     1    Sender       ID
1     1  Receiver  person1
2     1    Sender  person1
3     1  Receiver       ID
4     4    Sender       ID
5     4  Receiver  person1
6     4    Sender  person1
7     4  Receiver       ID
8     5    Sender       ID
9     5  Receiver  person1
10    1    Sender       ID
11    1  Receiver  person2
12    2    Sender  person1
13    2  Receiver       ID
14    4    Sender       ID
15    4  Receiver  person2
16    4    Sender  person1
17    4  Receiver       ID
18    4    Sender       ID
19    4  Receiver  person2

#groupby by People and aggregate sum and size
chat_df1 = chat_df.groupby('People')['Len'].agg(['size','sum'])
chat_df1.columns = ['Count','Len_words']
chat_df1 = chat_df1.reset_index()
#filter all sizes higher as 5
chat_df1 = chat_df1[chat_df1['Count'] > 5]
print (chat_df1)
    People  Count  Len_words
0       ID     10         30
1  person1      7         21

最新更新