清管器消除彼此间隔1分钟内发生的重复事件



我们在cdh4集群中使用pig-0.110-cdh4.3.0,我们需要消除一些web日志的重复。解决方案的想法(用SQL表示)是这样的:

SELECT
     T1.browser,
     T1.click_type,
     T1.referrer,
     T1.datetime,
     T2.datetime
FROM
     My_Table T1
INNER JOIN My_Table T2 ON
     T2.browser = T1.browser AND
     T2.click_type = T1.click_type AND
     T2.referrrer = T1.referrer AND
     T2.datetime > T1.datetime AND
     T2.datetime <= DATEADD(mi, 1, T1.datetime)

我从这里获得了上面的SQL查找在1分钟内出现的重复记录。我希望我能在Pig中实现类似的解决方案,但我发现Pig显然不支持通过表达式(仅通过字段)进行上述连接所需的JOIN。你知道如何用Pig消除1分钟内的重复事件吗?谢谢

一种方法是,您可以通过所需的参数来完成类似于此组的操作

         top3 = foreach grpd {
            sorted = filter records by time < 60;
            top    = limit sorted 2;
            generate group, flatten(top);
         };

这将是的另一种方法

   records_group = group records by (browser, click_type, referrer);
   with_min = FOREACH records_group 
   GENERATE
   FLATTEN(records), MAX(records.datetime) as maxDt ;
  filterRecords = filter with_min by (maxDt - $2 ) <60;

$2是数据时间位置,相应地更改

在我的脑海中,这样的东西可以工作,但需要测试:

view = FOREACH input GENERATE browser, click_type, referrer, datetime, GetYear(datetime) as year, GetMonth(datetime) as month, GetDay(datetime) as day, GetHour(datetime) as hour, GetMinute(datetime) as minute;
grp = GROUP view BY (browser, click_type, referrer, year, month, day, hour, minute);
uniq = FOREACH grp {
    top = LIMIT view 1;
    GENERATE FLATTEN(view.(browser, click_type, referrer, datetime))
}

当然,如果一个事件发生在12:03:45,另一个事件在12:03:59,这些事件将在同一组中,12:04:45和12:05:00将在不同组中。

要获得确切的60秒差异,您需要编写一个UDF,它将迭代分组在(browser、click_type、referrer)上的已排序的包,并删除不需要的行。

Aleks and Marq,

  records_group = group records by (browser, click_type, referrer);
  with_min = FOREACH records_group 
           GENERATE FLATTEN(records), MAX(records.datetime) as max 
  with_min = FOREACH with_min GENERATE browser, click_type, referrer, 
            ABS(max - dateime) as maxDtgroup;
  regroup = group with_min by (browser, click_type, referrer, maxDtgroup);

使用maxDtGroup重新分组是关键,并过滤前1条记录。

相关内容

  • 没有找到相关文章

最新更新