正则表达式超过电子邮件 Python 中不断变化的条件



我有员工-客户电子邮件交换,需要拉取客户端消息正文,用于将来的情绪分析。

这些电子邮件是使用不同的电子邮件应用程序生成的,因此我没有一个正则表达式规则可以用来分隔电子邮件,并且它们并不都符合电子邮件模块使用的表单,因此对象样式解析是不可能的。有时不同的电子邮件应用程序混合在一个链中,所以我也无法在特定配置文件上正则表达式。

但是,这些规则已经变得可靠:

电子邮件开始:

  • *@[不是极致] 写道:
  • [换行]到: *@acme.com
  • [换行]到: *'极致支持'
  • 链条的起点

电子邮件结束:

  • *@[Acme]>写道:
  • [换行]来自: *@acme.com
  • 链条末端

这些可以在链条的使用寿命期间混合和匹配。消息可以以"已写"规则开头,以"[换行符]发件人:"规则或"*@acme>已写"规则等结尾。

有什么优雅的方法可以为此正则表达式设置不同的开始和结束条件吗?理想情况下,它会在第一次击中最终规则时懒惰地停止。

FWIW,我认为自己严格来说是蟒蛇的中间人。经验丰富,可以通过文档进行有意义的斗争,但不足以使用语言的更深层次。

示例源数据:

thank you john

from: noreply@acme.com [mailto:noreply@acme.com] on behalf
of acme help
sent: thursday, december 29, 2016 11:28 am
to: Jane Doe
subject: re: aha - overtime



hi Jane,

it is affected by your payroll schedule. because it is semi-monthly,
overtime is a tricky thing to calculate, so we have to make sure we do
it just right! once i turn this setting on, you will be good to go from
this point on!

best regards,

john doe
customer experience team 

[]

<http://portal.mxlogic.com/images/transparent.gif> 

<http://portal.mxlogic.com/images/transparent.gif> 
ref:_00d15ft7b._50015ypl8b:
Jane_Doe@her_company.com wrote:
refit will not come up, even after logging on. i have my pass word and user
name write on a sheet of paper in my wallet, so i know it is correct. it
looks like it is trying to come up, but all i see is  two arrows going
in circles..

Jane

from: acme support [mailto:help@acme.com] 
sent: thursday, january 05, 2017 10:42 am
to: Jane Doe
subject: [graymail] re: happy new year from acme!

hello Jane,

sorry to hear that you're having trouble using acme. can you please
elaborate on the issue that you're experiencing?

best regards,

john
customer experience team 


--------------- original message ---------------
from: Jane Doe [jane_doe@her_company.com]
sent: 1/5/2017 11:34 am
to: support@acme.com
subject: re: [graymail] happy new year from amce!
our acme app is not working

期望的结果(以任何格式,我已经利用 re.findall() 在列表中存储了较早、更简单的正则表达式):

thank you john
refit will not come up, even after logging on. i have my pass word and user
name write on a sheet of paper in my wallet, so i know it is correct. it
looks like it is trying to come up, but all i see is  two arrows going
in circles..

Jane
our acme app is not working

编辑:

我能够更早地使用这样的代码解析聊天日志。源数据当前存储在仅由 client_id 个日志对组成的 pandas 数据帧中。我目前的问题结构相同,分为client_id - email_chain对:

for index, row in df_chttext.iterrows(): #for each client-chat item:
list_cleaned = [] #clear out old list_cleaned
chat = row['chat_log'] #grab chat log
list_visitor = re.findall('Visitor: .*?<br>', chat) #get list of only visitor messages
if list_visitor: #if there is a list of client messages
for message in list_visitor: #scrub the message
scrub = message.replace('Visitor: ','')
scrub = scrub.replace('<br>','')
scrub = scrub.replace('&#39;',''')
scrub = scrub.replace('&gt;','>')
scrub = scrub.replace('&lt;','<')
list_cleaned.append(scrub)
df_chttext.at[index,'chat_log'] = list_cleaned #replace previous chat with scrubbed chat
else:
df_chttext.at[index,'chat_log'] = '' #if no user messages, then leave it empty

我建议采用逐行"捕获状态"方法*。您可以逐行读取文件,并决定是否将其包含在最终输出中。

考虑如下:

  1. 读一行。
  2. 测试该行是否与"电子邮件开始"或"电子邮件结束"模式匹配。
  3. 如果与 start 匹配,则应将捕获状态
  4. 变为on,如果匹配end,则应将捕获状态off
  5. 测试它是否与正文中不需要的行匹配,但不应影响捕获状态。如果匹配,则跳过此行。
  6. 如果捕获状态为 ON,我们不应该跳过此行,然后将其添加到整个正文中。否则明智的做法是不要将其添加到正文中并转到下一行。

下面是一些python(几乎是伪)代码来实现这一点。

这个脚本不完整(我不想为你编写所有代码),但这可以给你一个开始工作的基础。(也许从将这些for循环写入函数开始,例如def matches_pattern_in_list(text, patterns)

import re
fname="data.txt"
# Are we capturing data?
isCapturing=True
# Patterns that turn capturing state "on"
startPatterns=[
re.compile(r'[^@]+(?!acme)[a-zA-Z]+.[a-zA-Z]{2,3}')
# .... more patterns here ....
]
# Patterns that will end the capturing state
endPatterns=[
re.compile(r'*@[acme]> wrote:')
]
# Patterns that doesn't affect capturing state,
# but still should be ignored
ignorePatterns=[
re.compile(r'from|sent|subject')
]
messageBodies=""
with open(fname) as f:
line = f.readline()
linenumber=1
while line:
skipThisLine=False
for patt in startPatterns:
if (patt.match(line)):
isCapturing=True
break
for patt in endPatterns:
if (patt.match(line)):
isCapturing=False
break
for patt in ignorePatterns:
if(patt.match(line)):
skipThisLine=True
break
if isCapturing and not skipThisLine:
messageBodies+=line

*:是的。我确实编造了。

最新更新