如何使用awk或grep从标题中提取电子邮件字段



关于:邮箱(mbox格式)邮件消息

Multi Message File: Inbox。mbox

From - Thu Mar 26 16:16:21 2015
From: Mail Delivery System <Mailer-Daemon@200.netwizz.com>
To: edge@notterribe.org
Subject: Mail delivery failed: returning message to sender
Message-Id: <E1Yb3yX-0004CB-QH@200.netwizz.com>
Date: Thu, 26 Mar 2015 02:21:17 -0700
Date: Thu, 26 Mar 2015 02:20:44 -0700
From: edge <edge@notterribe.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.5.0
MIME-Version: 1.0
To: leasing@theedgehenderson.com
CC: etpmgr@movein.net, t.simmonds@movein.ne
Subject: Fwd: Today's Breach Of Our Security.
From - Fri Mar 27 12:00:00 2015  

模式匹配顺序;

Date: Thu, 26 Mar 2015 02:21:17 -0700  
From - Thu Mar 26 16:16:21 2015  
From: Mail Delivery System <Mailer-Daemon@200.netwizz.com>  
To: edge@notterribe.org  
Message-Id: &lt;E1Yb3yX-0004CB-QH@200.netwizz.com>  
Subject: Mail delivery failed: returning message to sender 
Date: Thu; 26 Mar 2015 02:21:17 -0700;From - Thu Mar 26 16:16:21 2015;From: Mail Delivery System <Mailer-Daemon@200.netwizz.com>;To: edge@notterribe.org;Message-Id: &lt;E1Yb3yX-0004CB-QH@200.netwizz.com>;Subject: Mail delivery failed: returning message to sender

目标;
*"收件箱"中的每封邮件。mbox"以"From "开头
*对于"^Date: |^From |^From: |^To: |^Message-Id: |^Subject: ",只匹配第一次出现一次,打印该行
*输出结果为csv格式,以分号

分隔

我试过;grep -a -E -i "^Date: |^From |^From: |^To: |^Message-ID: |^Subject: " Inbox.mbox
awk '/^Date: / || /^From / || /^From: / || /^To: / || /^Message-ID: / || /^Subject: /' Inbox.mbox

注释:上面给了我一个很好的开始,我最熟悉awk和grep,所以我试着只使用它们。难以按我希望的顺序打印出行,匹配仅以换行符结尾的第一个出现。有些消息中存在二进制数据,所以我使用了-a和grep。

任何帮助都将非常感激。
谢谢你。

好的,那么你只有雷鸟mbox。

这是我的想法,在一个名为mbox2csv的文件中:

#!/usr/bin/gawk -f
BEGIN {
    # initialize an array and set the "i" variable to 0
    i = split("", row, ":");
}
# awk does not have a "join"
function join(array, sep) {
    sep = sep ? sep : ";";
    result = array[0];
    for (i=1; i<length(array); ++i) {
        result = result sep array[i];
    }
    return result;
}
# the keys you want to store
/^(From|Date|To|Message-ID|Subject):/ {
    row[i++] = $0;
}
# every time we match a mbox message separator
/^From /{
    # if there is data (not the first line)
    if (length(row) > 1) {
        print join(row);
        # reinitialise the array and "i"
        i = split("", row, ":");
    }
}

Then: mbox2csv INBOX > result.csv

大警告:* 这没有考虑在internet标头中常见的行延续,也没有转义行。

编辑:代码将在gist

最新更新