猪错误0:标量在输出中有多个行



我有两个文件,我试图根据模式匹配加入这两个文件。

File1 :
weather.bbc.co.uk,112 
ads.facebook.com,113 
ads.amazon.co.uk,114 
www.sky.com,115 
news.bbc.co.uk,116 
pics.facebook.com,117
File2 :
facebook.com,facebook 
bbc.co.uk,bbc 
netflix.com,netflix 
flipkart.com,flipkart
output:
weather.bbc.co.uk,112,bbc.co.uk,bbc
ads.facebook.com,113,facebook.com,facebook
news.bbc.co.uk,116,bbc.co.uk,bbc
pics.facebook.com,117,facebook.com,facebook 
Script
file1 = LOAD '/file1' using PigStorage('|') as (request_domain: chararray,msisdn:int);       
file2 = LOAD '/file2' using PigStorage('|') as (domain: chararray,provider: chararray);
file3 = JOIN file1 by case when (request_domain MATCHES CONCAT(CONCAT('(?i).*',file2.domain),'.*')) then file2.domain  else 'Other' end LEFT OUTER,file2 by domain;
DESCRIBE file3;            
dump file3;

但是我遇到的错误如下:

warn [thread -29] org.apache.hadoop.mapred.localjobrunner- job_local_0006 org.apache.pig.backend.executionEngine.execexception: 错误0:标量在输出中有多个行。第一: (Facebook.com,Facebook(,第二:( BBC.co.uk,BBC( org.apache.pig.impl.builtin.readscalars.exec(readscalars.java:111(at org.apache.pig.backend.hadoop.executionEngine.physicallayer.expressionoperators.pouserfunc.getNext(pouserfunc.java:330( 在 org.apache.pig.backend.hadoop.executionEngine.physicallayer.expressionoperators.pouserfunc.getNextstring(pouserfunc.java:432( 在 org.apache.pig.backend.hadoop.executionEngine.physicallayer.physicaloperator.getNext(yromalloperator.java:317( 在 org.apache.pig.backend.hadoop.executionEngine.physicallayer.expressionerators.pouserfunc.processInput(pouserfunc.java:221( 在 org.apache.pig.backend.hadoop.executionEngine.physicallayer.expressionoperators.pouserfunc.getNext(pouserfunc.java:275( 在 org.apache.pig.backend.hadoop.executionEngine.physicallayer.expressionoperators.pouserfunc.getNextstring(pouserfunc.java:432( 在 org.apache.pig.backend.hadoop.executionEngine.physicallayer.physicaloperator.getNext(yromalloperator.java:317( 在 org.apache.pig.backend.hadoop.executionEngine.physicallayer.expressionerators.pouserfunc.processInput(pouserfunc.java:221( 在 org.apache.pig.backend.hadoop.executionEngine.physicallayer.expressionoperators.pouserfunc.getNext(pouserfunc.java:275( 在 org.apache.pig.backend.hadoop.executionEngine.physicallayer.expressionoperators.pouserfunc.getNextstring(pouserfunc.java:432(

独立的应为"而不是" |" - >Pigstorage(','(

该模式将匹配多个值,尝试使用以下

的UDF索引的交叉功能
file1 = LOAD 'data/file1.txt' using PigStorage(',') as (request_domain: chararray,msisdn:int);       
file2 = LOAD 'data/file2.txt' using PigStorage(',') as (domain: chararray,provider: chararray);
crossed = CROSS file1,file2;
filtered = FILTER crossed BY INDEXOF(file1::request_domain,file2::domain) != -1 ;

最新更新