我们在cdh4集群中使用pig-0.110-cdh4.3.0,我们需要消除一些web日志的重复。解决方案的想法(用SQL表示)是这样的:
SELECT
T1.browser,
T1.click_type,
T1.referrer,
T1.datetime,
T2.datetime
FROM
My_Table T1
INNER JOIN My_Table T2 ON
T2.browser = T1.browser AND
T2.click_type = T1.click_type AND
T2.referrrer = T1.referrer AND
T2.datetime > T1.datetime AND
T2.datetime <= DATEADD(mi, 1, T1.datetime)
我从这里获得了上面的SQL查找在1分钟内出现的重复记录。我希望我能在Pig中实现类似的解决方案,但我发现Pig显然不支持通过表达式(仅通过字段)进行上述连接所需的JOIN。你知道如何用Pig消除1分钟内的重复事件吗?谢谢
一种方法是,您可以通过所需的参数来完成类似于此组的操作
top3 = foreach grpd {
sorted = filter records by time < 60;
top = limit sorted 2;
generate group, flatten(top);
};
这将是的另一种方法
records_group = group records by (browser, click_type, referrer);
with_min = FOREACH records_group
GENERATE
FLATTEN(records), MAX(records.datetime) as maxDt ;
filterRecords = filter with_min by (maxDt - $2 ) <60;
$2是数据时间位置,相应地更改
在我的脑海中,这样的东西可以工作,但需要测试:
view = FOREACH input GENERATE browser, click_type, referrer, datetime, GetYear(datetime) as year, GetMonth(datetime) as month, GetDay(datetime) as day, GetHour(datetime) as hour, GetMinute(datetime) as minute;
grp = GROUP view BY (browser, click_type, referrer, year, month, day, hour, minute);
uniq = FOREACH grp {
top = LIMIT view 1;
GENERATE FLATTEN(view.(browser, click_type, referrer, datetime))
}
当然,如果一个事件发生在12:03:45,另一个事件在12:03:59,这些事件将在同一组中,12:04:45和12:05:00将在不同组中。
要获得确切的60秒差异,您需要编写一个UDF,它将迭代分组在(browser、click_type、referrer)上的已排序的包,并删除不需要的行。
Aleks and Marq,
records_group = group records by (browser, click_type, referrer);
with_min = FOREACH records_group
GENERATE FLATTEN(records), MAX(records.datetime) as max
with_min = FOREACH with_min GENERATE browser, click_type, referrer,
ABS(max - dateime) as maxDtgroup;
regroup = group with_min by (browser, click_type, referrer, maxDtgroup);
使用maxDtGroup重新分组是关键,并过滤前1条记录。