如何使用Pig加载复杂的web日志语法



我是Pig的初学者。我已经安装了cdh4-pig,并且我连接到了一个cdh4集群。我们需要处理这些将是巨大的web日志文件(这些文件已经被加载到HDFS)。不幸的是,日志语法非常复杂(不是典型的逗号分隔文件)。一个限制是,我目前无法用其他工具预处理日志文件,因为它们太大了,无法存储副本。这是日志中的一条原始行:

"2013-07-02 16:17:12-0700","?c=Thing.Prender&d={%22renderType%22:%22Primary%22,%22render Source%22:%22 Folio%22,22things%22:[{%22pitemId%22:%225442f624492068b7ce7e2dd59339ef35%22,%22userItemId%22:%22873ef2080b337b57896390c9f747db4d%22,%22listId%22:%2 bf5bbeaa8eae459a83fb9e2ceb99930d%22,%2所有者Id%22:%222a034e6b2e800c3ff2f128fa4f1b387%22}],%22redirectId%22:%22tgvm%22,%22sourceId%22:%226da6f959-8309-4387-84c6-a5ddc10c22dd%22,%12valid%22:false,%22pageLoadId%22:%224ada55ef-4ea9-4642-ada5-053c45c00a4%22,%22clientTime%22:%222013-07-02T23:18:07.243Z%22,%22cclientTimeZone%22:5,%22crocess%22:%22ml.mobileweb.fb%22,%22c%22:%22Ching.Render%22}","http://m.someurl.com/listthing/5442f624492068b7ce7e2dd59339ef35?rdrId=tgvm&userItemId=873ef2080b337b57896390c9f747db4d&fmlrdr=t&itemId=5442f624492068b7ce7e2dd59339ef35&subListId=bf5bbeaa8eae459a83fb9e2ceb99930d&puid=2a4034e6b2e800c3ff2f128fa4f1b387&mlrdr=t","Mozilla/5.0(iPhone;CPU iPhone OS 6_1_3类似Mac OS X)AppleWebKit/536.26(KHTML,如壁虎)移动/10B329[FBAN/FBIOS;FBAV/6.2;FBBV/228172;FBDV/iPhone 4,1;FBMD/iPhone;FBSN/iPhoneOS;FBSV/66.1.3;FBSS/2;FBCR/SSprint;FBID/电话;FBLC/en_US;FBOP/1]","10.nn.nn.nnn","nn.nn.nn.nn,nn。nn。0.20"

正如您可能注意到的,那里嵌入了一些json,但它是url编码的。url解码后(Pig可以进行url解码吗?)以下是json的外观:

{"renderType":"Primary","renderSource":"Folio","things":[{"itemId":"5442f624492068b7ce7e2dd59339ef35","userItemId":7-84c6-a5ddc10c22dd","valid":false,"pageLoadId":"4ada55ef-4ea9-4642-ada5-053c45c00a4","clientTime":"2013-07-02T23:18:07.243Z","clientTimeZone":5,"process":"ml.mobileweb.fb","c":"Thing.RRender"}

我需要提取json中的不同字段和"things"字段,后者实际上是一个集合。我还需要提取日志中的其他查询字符串值。Pig可以直接处理这种源数据吗?如果可以,你能指导我如何让Pig解析和加载它吗?

谢谢!

对于这样复杂的任务,您通常需要编写Load函数。我推荐第11章。在编程清管器中写入加载和存储函数。官方文档中的加载/存储函数过于简单。

我做了很多实验,学到了很多。尝试了几个json库、piggybank和java.net.URLDecoder。甚至尝试了CSVExcelStorage。我注册了图书馆,并能够部分解决问题。当我对较大的数据集运行测试时,它开始在源数据的某些行中遇到编码问题,导致异常和作业失败。因此,我最终使用Pig的内置regex功能来提取所需的值:

A = load '/var/log/live/collector_2013-07-02-0145.log' using TextLoader();
-- fix some of the encoding issues
A = foreach A GENERATE REPLACE($0,'\\"','"'); 
-- super basic url-decode
A = foreach A GENERATE REPLACE($0,'%22','"');
-- extract each of the fields from the embedded json
A = foreach A GENERATE 
    REGEX_EXTRACT($0,'^.*"redirectId":"([^"\}]+).*$',1) as redirectId, 
    REGEX_EXTRACT($0,'^.*"fromUserId":"([^"\}]+).*$',1) as fromUserId, 
    REGEX_EXTRACT($0,'^.*"userId":"([^"\}]+).*$',1) as userId, 
    REGEX_EXTRACT($0,'^.*"listId":"([^"\}]+).*$',1) as listId, 
    REGEX_EXTRACT($0,'^.*"c":"([^"\}]+).*$',1) as eventType,
    REGEX_EXTRACT($0,'^.*"renderSource":"([^"\}]+).*$',1) as renderSource,
    REGEX_EXTRACT($0,'^.*"renderType":"([^"\}]+).*$',1) as renderType,
    REGEX_EXTRACT($0,'^.*"engageType":"([^"\}]+).*$',1) as engageType,
    REGEX_EXTRACT($0,'^.*"clientTime":"([^"\}]+).*$',1) as clientTime,
    REGEX_EXTRACT($0,'^.*"clientTimeZone":([^,\}]+).*$',1) as clientTimeZone;

我决定不使用REGEX_EXTRACT_ALL,以防字段的顺序发生变化。

最新更新