pig join and average



我正在尝试自己学习pig,我有以下脚本:

customer_ratings = LOAD 'customer_ratings.txt' as (i_id:int, customer_id:int, rating:int); 
item_data = LOAD 'item_data.txt' USING PigStorage(',') as (item_id:int,item_name:chararray, dummy:int,item_url:chararray);
item_join = join item_data by item_id, customer_ratings by i_id;
item_group = GROUP item_join ALL;
item_foreach = foreach item_group generate item_id, item_name, item_url,  AVG(item_join.rating);
PRINT = limit item_foreach 40;
dump PRINT;

foreach失败,出现以下错误:

  Invalid field projection. Projected field [item_id] does not exist in schema: group:char array,item_join:bag{:tuple(item_data::item_id:int,item_data::item_name:char array,item_data::dummy:int,item_data::item_url:chararray,customer_ratings::i_id:int,customer_ratings::customer_id:int,customer_ratings::rating:int)}.

我知道为了实现这一点,我在教程中有一些不理解的地方。。。知道如何打印foreach中的内容吗?

我还尝试了generate item_data::item_id, item_data::item_name, etc.,如(pig-how to reference columns in a FOREACH after a JOIN?)中所述,但这也不起作用。。。

customer_ratings = LOAD 'customer_ratings.txt' as (i_id:int,customer_id:int, rating:int); 
item_data = LOAD 'item_data.txt' USING PigStorage(',') as (item_id:int,item_name:chararray, dummy:int,item_url:chararray);
item_join = foreach (
             join item_data by item_id, 
             customer_ratings by i_id
             )
            generate 
             item_data::item_id as item_id, 
             item_data::item_name as item_name,
             cutsomer_rating::rating as rating
            ;
item_group = GROUP item_join by (item_id, item_url);
item_foreach = foreach item_group generate 
                FLATTEN(group) as (item_id, item_url), 
                AVG(item_join.rating)
               ;
PRINT = limit item_foreach 40;
dump PRINT;

我认为,像这样的东西是有效的。虽然我还没有测试过。我做了两件事。首先,在加入后,我去给字段命名一些简单的名称,这样我们就不必带着一堆名为relation.fieldname.的字段

扁平化组是一种更容易的方法,可以通过以下方式将密钥从组中取出。在您的示例中,我认为您需要使用类似的东西

generate item_join.item_data::item_id 

最新更新