pandas:只有当另一列中的值匹配时，才计算行之间的重叠单词

我有一个数据帧，看起来如下，但有很多行：

import pandas as pd
data = {'intent':  ['order_food', 'order_food','order_taxi','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['new','taxi']]}
df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])

我已经使用下面的代码计算了jaccard的相似性(不是我的解决方案(：

def lexical_overlap(doc1, doc2): 
words_doc1 = set(doc1) 
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)

return intersection

并修改了@Amit Amola给出的代码，以比较每两行之间的重叠单词，并从中创建了一个数据帧：

overlapping_word_list=[]
for val in list(combinations(range(len(data_new)), 2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])

由于我的数据集很大，所以当我运行此代码来比较所有行时，需要花费很长时间。所以我只想比较意图相同的句子，而不比较意图不同的句子。我不确定如何继续只做

IIUC您只需要迭代intent列中的唯一值，然后使用loc只获取对应的行。如果有两行以上的行，则仍然需要使用combinations来获得类似意图之间的唯一combinations。

from itertools import combinations
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent, ["Sent"]].Sent.to_list()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x, y)
print(f"Overlap for ({x}) and ({y}) is {overlap}")
#  Overlap for (i need hamburger) and (she wants sushi) is 46.666666666666664
#  Overlap for (i need a cab) and (i would like a new taxi) is 40.0
#  Overlap for (call me at 6) and (she called me) is 54.54545454545454

好吧，所以我根据@gold_cy的回答：找到了如何获得评论中提到的我想要的输出

for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x[1], y[1])
print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")

相关内容

最新更新

热门标签：