将聊天对话拆分为句子并映射响应



我有以下数据:

Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)

我试图将其拆分为问答格式,如下所示:

Rep: hi !
Customer: i was wondering if you have a delivery option? If so what are the options available ? 
Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options.
Customer: ok! thank you
Rep: Is there anything else that I can help you with?
(Chat ended)

这是一组具有唯一 ID 的对话。拆分后,我希望将每个问题和答案作为不同的列,适当匹配每个响应。

我尝试了以下方法:

for i in d.split(':'):
if i:
print(i.strip().split('.'))

输出如下:

['Rep']
['hi ! Customer']
['i was wondering if you have a delivery option? If so what are the options available ? Rep']
["i'd be happy to answer that for you! There is a 2 and 5 day delivery options", ' Customer']
['ok! thank you Rep']
['Is there anything else that I can help you with? (Chat ended)']

您可以使用更简单的正则表达式!!

import re
p = re.compile('(w*s*:)')
input_string = "Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)"
new_string = p.sub(r'ng<1>',input_string)
for line in new_string.split('n')[1:]:
print line

':'分开是危险的,因为对话本身可能包含':'

您应该首先拥有代表和客户的姓名,以便您可以搜索他们的姓名,然后以正则表达式模式搜索:,您可以使用re.findall将示例聊天解析为:

[('Rep', 'hi !'), ('Customer', 'i was wondering if you have a delivery option? If so what are the options available ?'), ('Rep', "i'd be happy to answer that for you! There is a 2 and 5 day delivery options."), ('Customer', 'ok! thank you')]

然后使用循环将项目映射到您喜欢的字典数据结构中:

import re
from pprint import pprint
def parse_chat(chat, rep, customer):
conversation = {}
rep_message = ''
for person, message in re.findall(r'({0}|{1}): (.*?)s*(?=(?:{0}|{1}):)'.format(rep, customer), chat):
if person == rep:
rep_message = message
else:
conversation[rep_message] = message
return conversation
chat = '''Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)'''
pprint(parse_chat(chat, 'Rep', 'Customer'))

这输出:

{'hi !': 'i was wondering if you have a delivery option? If so what are the options available ?',
"i'd be happy to answer that for you! There is a 2 and 5 day delivery options.": 'ok! thank you'}

解决方案

基于冒号后面只有单个非空格分隔的单词的假设,最好的方法是使用正则表达式来匹配冒号前的CustomerRep字符串,然后插入换行符以获得适当的格式。

下面是一个工作示例:

import re
# The data has been stored into a string by this point
data = "Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)"
# First insert the newlines before the first word before a colon
newlines = re.sub(r'(S+)s*:', r'ng<1>:', data)
# Remove the first newline and fix the (Chat ended) on the end
solution = re.sub(r'(Chat ended)', 'n(Chat ended)', newlines[1:])
print(solution)
> "Rep: hi ! 
Customer: i was wondering if you have a delivery option? If so what are the options available ? 
Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. 
Customer: ok! thank you 
Rep: Is there anything else that I can help you with? 
(Chat ended)"

解释

newlines = re.sub...行首先在data字符串中搜索任何非空格分隔的单词,后跟冒号,然后将其替换为n字符,后跟匹配的非空格字符序列,S+(可以是CustomerRepBill等(,然后在末尾插入:

最后,假设所有对话都以(Chat ended)结尾,之后的代码行仅匹配该文本,并以与newlines = re.sub...行相同的方式将其移动到新行。

输出是一个字符串,但如果您需要它是其他任何东西,您可以根据'n'拆分它并执行之后必须执行的操作。

所以你基本上想确定你想在哪里插入换行符 - 因此,你可以尝试几种不同的模式,如果它总是"客户"和"代表":

(?<!^)(Customer:|Rep:|(Chat ended)演示

我们只需检查我们是否不在字符串的开头,然后通过将它们 OR 组合在一起来匹配常量标记。 或者更笼统地说,

(?<=s)([A-Z]w+:|(Chat ended)演示

我们回头看一个空格(我们不在字符串的开头(,然后匹配 CapitalizedWord+COLON 或结束序列,然后在每次匹配之前插入换行符。

两者的替换:

n$0