Pig:提取没有一列区别的记录



我想提取一列不不同的记录,如何实现它?

例如输入:

(user1, value1, value2)
(user1, value3, value4)
(user2, value5, value6)
(user3, value7, value8)
(user4, value9, value10)
(user4, value11, value12)

提取具有第1列重复值的记录后,输出为:

(user1, value1, value2)
(user1, value3, value4)
(user4, value9, value10)
(user4, value11, value12)

提前感谢!

请告诉我这是否适合您。出于测试目的,我使用value1和value2作为字符数组,但在实际代码中,将value1和value2更改为int或long

input.txt
user1,value1,value2
user1,value3,value4
user2,value5,value6
user3,value7,value8
user4,value9,value10
user4,value11,value12
PigScript
A = LOAD 'input.txt' USINg PigStorage(',') AS (user:chararray,value1:chararray,value2:chararray);
B = GROUP A BY user;
C = FOREACH B  GENERATE FLATTEN(A),COUNT(A) AS cnt;
D = FILTER C BY cnt >1;
E = FOREACH D GENERATE A::user,A::value1,A::value2;
DUMP E;
Output:
(user1,value1,value2)
(user1,value3,value4)
(user4,value9,value10)
(user4,value11,value12)

最新更新