如何编写PIG查询来获取字段中值的存在计数?
例如:
字段 A |字段 B
20|美国广播公司;
21|XYZ;
25|空;
99|韦尔;
45|空;
89|福伊;
所需 O/P : 字段 A 计数 = 6,字段 B 计数 = 4
Pig 不会将上述输入视为null
它基本上是一个chararray
,因此所有内置函数如(is null, is not null
)在这种情况下都不起作用。您需要对所有字段进行分组,过滤掉空值并获取计数。你能试试下面的脚本吗?
输入
20|ABC;
21|XYZ;
25|null;
99|WER;
45|null;
89|FOY;
猪脚本:
A = LOAD 'input' USING PigStorage('|') AS (f1:int,f2:chararray);
B = GROUP A ALL;
C = FOREACH B {
filterNull = FILTER A BY (f2!='null;');
GENERATE COUNT(A.f1) AS fieldA, COUNT(filterNull.f2) AS fieldB;
}
DUMP C;
输出:
(6,4)
请查找要遵循的步骤以获取输出
fieldcount = load '/user/examples/stackoverflow/count.txt' using PigStorage('|') as (a:int, b:chararray);
fieldcount1 = FOREACH fieldcount GENERATE a, REPLACE(b,';','') as b;
fieldcount2 = GROUP fieldcount1 ALL;
fieldcount3 = FOREACH fieldcount2 {
a_cnt = FILTER fieldcount1 BY a is not null;
b_cnt = FILTER fieldcount1 BY b is not null and b != 'null' ;
GENERATE COUNT(a_cnt) as a_count, COUNT(b_cnt) as b_count;
}
请找到答案:-我的示例数据是
003 Amit Delhi India 12000
004 Anil Delhi India 15000
005 Deepak Delhi India 34000
006 Fahed Agra India 45000
007 Ravi Patna India 98777
008 Avinash Punjab India 120000
009 Saajan Punjab India 54000
001 Harit Delhi India 20000
002 Hardy Agra India 20000
011 Banglore
它都被空间隔开了
代码如下:-
A = load '/edata' using PigStorage(' ') as (eid:int,name:chararray,city:chararray,country:chararray,salary:int);
s = group A ALL ;
result = foreach s generate COUNT(A.eid),COUNT(A.name),COUNT(A.country),COUNT(A.salary);
dump result ;
你会得到以下结果:-
(10,9,9,9)
输入:
20|ABC
21|XYZ
25|null
99|WER
45|null
89|FOY
脚本:
inputData = LOAD 'input' using PigStorage('|');
grouped_input = GROUP inputData ALL;
counts = FOREACH grouped_input GENERATE COUNT($1), COUNT($2);
dump counts;