如何在配置单元中转换/透视数据

我知道没有直接的方法可以在配置单元中转换数据。我提出了这个问题：有没有一种方法可以在Hive中转换数据，但由于那里没有最终的答案，无法一路过关斩将。

这是我的桌子：

 | ID   |   Code   |  Proc1   |   Proc2 | 
 | 1    |    A     |   p      |   e     | 
 | 2    |    B     |   q      |   f     |
 | 3    |    B     |   p      |   f     |
 | 3    |    B     |   q      |   h     |
 | 3    |    B     |   r      |   j     |
 | 3    |    C     |   t      |   k     |

这里Proc1可以有任意数量的值。ID，代码&Proc1共同构成了该表的唯一键。我想透视/转置这个表，使Proc1中的每个唯一值都成为一个新列，Proc2中的相应值就是对应行在该列中的值。在本质上，我正试图得到这样的东西：

 | ID   |   Code   |  p   |   q |  r  |   t |
 | 1    |    A     |   e  |     |     |     |
 | 2    |    B     |      |   f |     |     |
 | 3    |    B     |   f  |   h |  j  |     |
 | 3    |    C     |      |     |     |  k  |

在新转换的表中，ID和代码是唯一的主键。从我上面提到的票来看，我可以使用to_map UDAF走到这一步。（免责声明-这可能不是朝着正确方向迈出的一步，但只是在这里提及，如果是的话）

 | ID   |   Code   |  Map_Aggregation   | 
 | 1    |    A     |   {p:e}            |
 | 2    |    B     |   {q:f}            |
 | 3    |    B     |   {p:f, q:h, r:j } |  
 | 3    |    C     |   {t:k}            |

但不知道如何从这一步到我想要的透视表/转置表。任何关于如何进行的帮助都将是伟大的！谢谢

以下是我使用hive的内部UDF函数"map"解决此问题的方法：

select
    b.id,
    b.code,
    concat_ws('',b.p) as p,
    concat_ws('',b.q) as q,
    concat_ws('',b.r) as r,
    concat_ws('',b.t) as t
from 
    (
        select id, code,
        collect_list(a.group_map['p']) as p,
        collect_list(a.group_map['q']) as q,
        collect_list(a.group_map['r']) as r,
        collect_list(a.group_map['t']) as t
        from (
            select
              id,
              code,
              map(proc1,proc2) as group_map 
            from 
              test_sample
        ) a
        group by
            a.id,
            a.code
    ) b;

"concat_ws"one_answers"map"是配置单元udf，而"collect_list"是一个配置单元udaf。

这是我最终使用的解决方案：

add jar brickhouse-0.7.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION collect AS 'brickhouse.udf.collect.CollectUDAF';
select 
    id, 
    code,
    group_map['p'] as p,
    group_map['q'] as q,
    group_map['r'] as r,
    group_map['t'] as t
    from ( select
        id, code,
        collect(proc1,proc2) as group_map 
        from test_sample 
        group by id, code
    ) gm;

to_map UDF是从砖房回购中使用的：https://github.com/klout/brickhouse

另一个解决方案。

使用Hivemall to_map函数的数据透视。

SELECT
  uid,
  kv['c1'] AS c1,
  kv['c2'] AS c2,
  kv['c3'] AS c3
FROM (
  SELECT uid, to_map(key, value) kv
  FROM vtable
  GROUP BY uid
) t

uid c1 c2 c3 101 11 12 13 102 21 22 23

取消预览

SELECT t1.uid, t2.key, t2.value
FROM htable t1
LATERAL VIEW explode (map(
  'c1', c1,
  'c2', c2,
  'c3', c3
)) t2 as key, value

uid key value 101 c1 11 101 c2 12 101 c3 13 102 c1 21 102 c2 22 102 c3 23

我还没有写这段代码，但我认为你可以使用klouts-brickhouse提供的一些UDF：https://github.com/klout/brickhouse

具体来说，你可以像这里提到的那样使用他们的收藏：http://brickhouseconfessions.wordpress.com/2013/03/05/use-collect-to-avoid-the-self-join/

然后使用本文中详细介绍的方法分解数组（它们的长度不同）http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_ra

我使用以下查询创建了一个名为hive的伪表-

create table hive (id Int,Code String, Proc1 String, Proc2 String);

已加载表中的所有数据-

insert into hive values('1','A','p','e');
insert into hive values('2','B','q','f'); 
insert into hive values('3','B','p','f');
insert into hive values('3','B','q','h');
insert into hive values('3','B','r','j');
insert into hive values('3','C','t','k');

现在使用下面的查询来实现输出

select id,code,
     case when collect_list(p)[0] is null then '' else collect_list(p)[0] end as p,
     case when collect_list(q)[0] is null then '' else collect_list(q)[0] end as q,
     case when collect_list(r)[0] is null then '' else collect_list(r)[0] end as r,
     case when collect_list(t)[0] is null then '' else collect_list(t)[0] end as t
     from(
            select id, code,
            case when proc1 ='p' then proc2 end as p,
            case when proc1 ='q' then proc2 end as q,
            case when proc1 ='r' then proc2 end as r,
            case when proc1 ='t' then proc2 end as t
            from hive
        ) dummy group by id,code;

如果是数值，您可以使用以下配置单元查询：

样本数据

ID  cust_freq   Var1    Var2    frequency
220444  1   16443   87128   72.10140547
312554  6   984 7339    0.342452643
220444  3   6201    87128   9.258396518
220444  6   47779   87128   2.831972441
312554  1   6055    7339    82.15209213
312554  3   12868   7339    4.478333954
220444  2   6705    87128   15.80822558
312554  2   37432   7339    13.02712127
select id, sum(a.group_map[1]) as One, sum(a.group_map[2]) as Two, sum(a.group_map[3]) as Three, sum(a.group_map[6]) as Six from
( select id, 
 map(cust_freq,frequency) as group_map 
 from table
 ) a group by a.id having id in 
( '220444',
'312554');
ID  one two three   six
220444  72.10140547 15.80822558 9.258396518 2.831972441
312554  82.15209213 13.02712127 4.478333954 0.342452643
In above example I have't used any custom udf. It is only using in-built hive functions.
Note :For string value in key write the vale as sum(a.group_map['1']) as One.

对于Unpivot，我们可以简单地使用以下逻辑。

SELECT Cost.Code, Cost.Product, Cost.Size
, Cost.State_code, Cost.Promo_date, Cost.Cost, Sales.Price
FROM
(Select Code, Product, Size, State_code, Promo_date, Price as Cost
FROM Product
Where Description = 'Cost') Cost
JOIN
(Select Code, Product, Size, State_code, Promo_date, Price as Price
FROM Product
Where Description = 'Sales') Sales
on (Cost.Code = Sales.Code
and Cost.Promo_date = Sales.Promo_date);

下面也是Pivot 的一种方法

SELECT TM1_Code, Product, Size, State_code, Description
  , Promo_date
  , Price
FROM (
SELECT TM1_Code, Product, Size, State_code, Description
   , MAP('FY2018Jan', FY2018Jan, 'FY2018Feb', FY2018Feb, 'FY2018Mar', FY2018Mar, 'FY2018Apr', FY2018Apr
        ,'FY2018May', FY2018May, 'FY2018Jun', FY2018Jun, 'FY2018Jul', FY2018Jul, 'FY2018Aug', FY2018Aug
        ,'FY2018Sep', FY2018Sep, 'FY2018Oct', FY2018Oct, 'FY2018Nov', FY2018Nov, 'FY2018Dec', FY2018Dec) AS tmp_column
FROM CS_ME_Spirits_30012018) TmpTbl
LATERAL VIEW EXPLODE(tmp_column) exptbl AS Promo_date, Price;

您可以使用case语句和collect_set的一些帮助来实现这一点。你可以看看这个。您可以在-查看详细答案http://www.analyticshut.com/big-data/hive/pivot-rows-to-columns-in-hive/

以下是查询以供参考，

SELECT resource_id,
CASE WHEN COLLECT_SET(quarter_1)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_1)[0] END AS quarter_1_spends,
CASE WHEN COLLECT_SET(quarter_2)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_2)[0] END AS quarter_2_spends,
CASE WHEN COLLECT_SET(quarter_3)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_3)[0] END AS quarter_3_spends,
CASE WHEN COLLECT_SET(quarter_4)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_4)[0] END AS quarter_4_spends
FROM (
SELECT resource_id,
CASE WHEN quarter='Q1' THEN amount END AS quarter_1,
CASE WHEN quarter='Q2' THEN amount END AS quarter_2,
CASE WHEN quarter='Q3' THEN amount END AS quarter_3,
CASE WHEN quarter='Q4' THEN amount END AS quarter_4
FROM billing_info)tbl1
GROUP BY resource_id;

相关内容

最新更新

热门标签：