HIve:从ORC转换为TEXT时的数据格式更改



我有一个具有以下架构的配置单元表:

CREATE EXTERNAL TABLE db_test.user_arry(
cstid string, 
prdctsslctd array<string>, 
indvprc array<bigint>, 
dscntamt array<bigint>, 
prdctsrjctd array<string>)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY 'n'
STORED AS INPUTFORMAT 
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/location/on/a/hadoop/'

其中存在的数据采用以下格式:

--------------------------------------------------------  
name | prdctsslctd | indvprc | dscntamt | prdctsrjctd 
--------------------------------------------------------   
cctg65  ["m_jns","cbyht"]        ["23","6"]       ["1","1"] ["shs","jkt"]
jju89o0 ["top","jeans_wmn"]      ["55","45"]      [NULL]         [NULL]
ju34hd  ["laychps","candy","toy"]["3","5","67"]["12","8"]["candy"]

尝试将此数据拉入数据类型为string所有列的表中

CREATE EXTERNAL TABLE db_test.user_strng(
cstid string, 
prdctsslctd string, 
indvprc string, 
dscntamt string, 
prdctsrjctd string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY 'n'
STORED AS textfile
LOCATION
'/location/on/a/hadoop/';

用:

insert into db_test.user_strng select * from db_test.user_arry;

实际提现价:

--------------------------------------------------------  
name | prdctsslctd | indvprc | dscntamt | prdctsrjctd 
--------------------------------------------------------   
cctg65  m_jnscbyht        236       11     shsjkt
jju89o0 topjeans_wmn      5545      NULL   NULL
ju34hd  laychpscandytoy   3567      128    candy

预期运营

--------------------------------------------------------  
name | prdctsslctd | indvprc | dscntamt | prdctsrjctd 
--------------------------------------------------------   
cctg65  "m_jns","cbyht"          "23","6"         "1","1"   "shs","jkt"
jju89o0 "top","jeans_wmn"        "55","45"         NULL           NULL
ju34hd  "laychps","candy","toy"  "3","5","67"     "12","8"  "candy"

不知道哪里出了问题,或者错过了什么?

Update_1

执行将数组转换为数组后表中的 O/P:

ALTER TABLE user_arry CHANGE indvprc indvprc array<string>;
ALTER TABLE user_arry CHANGE dscntamt dscntamt array<string>;

--------------------------------------------------------  
name | prdctsslctd | indvprc | dscntamt | prdctsrjctd 
--------------------------------------------------------   
cctg65  ["m_jns","cbyht"]        ["23","6"]       ["1","1"] ["shs","jkt"]
jju89o0 ["top","jeans_wmn"]      ["55","45"]      []         []
ju34hd  ["laychps","candy","toy"]["3","5","67"]["12","8"]["candy"]

所有数据类型均为字符串的表中的最终 O/P:

--------------------------------------------------------  
name | prdctsslctd | indvprc | dscntamt | prdctsrjctd 
--------------------------------------------------------   
cctg65  m_jns    cbyht      23   6    1      1      shs  jkt
jju89o0 top      jeans_wmn  55   45       
ju34hd  laychps  candy      toy  3    5      67     12    8    candy

仍然没有得到所需的 o/p。

Update_2

正如建议的那样,FIELDS TERMINATED BY ','的更改是否FIELDS TERMINATED BY 't'.以所需格式获取数据。

将所有数组类型更改为array <string>

alter table ALTER TABLE user_arry CHANGE indvprc indvprc array<string>;
alter table ALTER TABLE user_arry CHANGE dscntamt dscntamt array<string>;

根据您的数据示例,array<bigint>不是像"23.45"这样的值的正确格式。array<string>应该适用于您的数据文件。

使用concat_ws将数组转换为逗号分隔的字符串:

insert into db_test.user_strng 
select  name, 
concat_ws(',',prdctsslctd)  as prdctsslctd,
concat_ws(',',indvprc)      as indvprc,
concat_ws(',',dscntamt)     as dscntamt,
concat_ws(',',prdctsrjctd) as prdctsrjctd 
from db_test.user_arry;

相关内容

  • 没有找到相关文章

最新更新