Apache pig,使用正则表达式解析组合日志



我使用猪拉丁脚本,我尝试使用正则表达式解析日志,但是,它在匹配双引号时返回错误" .如:错误 1200:意外字符 ' " '日志格式:

118.102.255.50 - - [17/Oct/2014:00:00:29 -0400] "GET /favicon.ico HTTP/1.1" 200 20 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101 Safari/537.36"

而我写的脚本:

test = LOAD '/pigdata/log' as (line:chararray);
log = FOREACH test GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^(\S+)\s+(\S+)\s+(\S+)\s+.(\S+\s+\S+).\s+"(\S+)\s+(.+?)\s+(HTTP[^"]+)"\s+(\S+)\s+(\S+)\s+"([^"]*)"\s+"(.*)"$')) AS (address_ip: chararray, logname: chararray, user: chararray, timestamp: chararray, method: chararray, uri: chararray, proto: chararray, status: int, bytes: int, referer: chararray, userAgent: chararray);
dump log; 

因为 Pig 使用 Java Regex,所以你需要通过这样的\来转义"

log = FOREACH test GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^(\S+)\s+(\S+)\s+(\S+)\s+.(\S+\s+\S+).\s+\"(\S+)\s+(.+?)\s+(HTTP[^"]+)\"\s+(\S+)\s+(\S+)\s+\"([^"]*)\"\s+\"(.*)\"$')) AS (address_ip: chararray, logname: chararray, user: chararray, timestamp: chararray, method: chararray, uri: chararray, proto: chararray, status: int, bytes: int, referer: chararray, userAgent: chararray);

最新更新