将聊天对话拆分为句子并映射响应

我有以下数据：

Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)

我试图将其拆分为问答格式，如下所示：

Rep: hi !
Customer: i was wondering if you have a delivery option? If so what are the options available ? 
Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options.
Customer: ok! thank you
Rep: Is there anything else that I can help you with?
(Chat ended)

这是一组具有唯一 ID 的对话。拆分后，我希望将每个问题和答案作为不同的列，适当匹配每个响应。

我尝试了以下方法：

for i in d.split(':'):
if i:
print(i.strip().split('.'))

输出如下：

['Rep']
['hi ! Customer']
['i was wondering if you have a delivery option? If so what are the options available ? Rep']
["i'd be happy to answer that for you! There is a 2 and 5 day delivery options", ' Customer']
['ok! thank you Rep']
['Is there anything else that I can help you with? (Chat ended)']

您可以使用更简单的正则表达式!!

import re
p = re.compile('(w*s*:)')
input_string = "Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)"
new_string = p.sub(r'ng<1>',input_string)
for line in new_string.split('n')[1:]:
print line

与':'分开是危险的，因为对话本身可能包含':'。

您应该首先拥有代表和客户的姓名，以便您可以搜索他们的姓名，然后以正则表达式模式搜索:，您可以使用re.findall将示例聊天解析为：

[('Rep', 'hi !'), ('Customer', 'i was wondering if you have a delivery option? If so what are the options available ?'), ('Rep', "i'd be happy to answer that for you! There is a 2 and 5 day delivery options."), ('Customer', 'ok! thank you')]

然后使用循环将项目映射到您喜欢的字典数据结构中：

import re
from pprint import pprint
def parse_chat(chat, rep, customer):
conversation = {}
rep_message = ''
for person, message in re.findall(r'({0}|{1}): (.*?)s*(?=(?:{0}|{1}):)'.format(rep, customer), chat):
if person == rep:
rep_message = message
else:
conversation[rep_message] = message
return conversation
chat = '''Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)'''
pprint(parse_chat(chat, 'Rep', 'Customer'))

这输出：

{'hi !': 'i was wondering if you have a delivery option? If so what are the options available ?',
"i'd be happy to answer that for you! There is a 2 and 5 day delivery options.": 'ok! thank you'}

解决方案

基于冒号后面只有单个非空格分隔的单词的假设，最好的方法是使用正则表达式来匹配冒号前的Customer和Rep字符串，然后插入换行符以获得适当的格式。

下面是一个工作示例：

import re
# The data has been stored into a string by this point
data = "Rep: hi ! Customer: i was wondering if you have a delivery option? If so what are the options available ? Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. Customer: ok! thank you Rep: Is there anything else that I can help you with? (Chat ended)"
# First insert the newlines before the first word before a colon
newlines = re.sub(r'(S+)s*:', r'ng<1>:', data)
# Remove the first newline and fix the (Chat ended) on the end
solution = re.sub(r'(Chat ended)', 'n(Chat ended)', newlines[1:])
print(solution)
> "Rep: hi ! 
Customer: i was wondering if you have a delivery option? If so what are the options available ? 
Rep: i'd be happy to answer that for you! There is a 2 and 5 day delivery options. 
Customer: ok! thank you 
Rep: Is there anything else that I can help you with? 
(Chat ended)"

解释

newlines = re.sub...行首先在data字符串中搜索任何非空格分隔的单词，后跟冒号，然后将其替换为n字符，后跟匹配的非空格字符序列，S+(可以是Customer、Rep、Bill等(，然后在末尾插入:。

最后，假设所有对话都以(Chat ended)结尾，之后的代码行仅匹配该文本，并以与newlines = re.sub...行相同的方式将其移动到新行。

输出是一个字符串，但如果您需要它是其他任何东西，您可以根据'n'拆分它并执行之后必须执行的操作。

所以你基本上想确定你想在哪里插入换行符 - 因此，你可以尝试几种不同的模式，如果它总是"客户"和"代表"：

(?<!^)(Customer:|Rep:|(Chat ended)演示

我们只需检查我们是否不在字符串的开头，然后通过将它们 OR 组合在一起来匹配常量标记。或者更笼统地说，

(?<=s)([A-Z]w+:|(Chat ended)演示

我们回头看一个空格(我们不在字符串的开头(，然后匹配 CapitalizedWord+COLON 或结束序列，然后在每次匹配之前插入换行符。

两者的替换：

n$0

解决方案

解释

相关内容

最新更新

热门标签：