外部表 (HIVE) 仅从文件中选择几列



如何创建仅设置文件中几列的外部表?

例如:在存档中,我有六列,A,B,C,D,E,F。但是在我的表格中,我只想要 A、C、F。

可能吗?

我不知道

有选择地将HDFS文件中的列包含在外部表中的方法。 根据您的使用案例,基于外部表定义视图以仅包含所需的列可能就足够了。 例如,给定以下愚蠢的外部表示例:

hive> CREATE EXTERNAL TABLE ext_table (
    >   A STRING,
    >   B STRING,
    >   C STRING,
    >   D STRING,
    >   E STRING,
    >   F STRING
    > )
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE
    > LOCATION '/tmp/ext_table';
OK
Time taken: 0.401 seconds
hive> SELECT * FROM ext_table;
OK
row_1_col_A row_1_col_B     row_1_col_C     row_1_col_D     row_1_col_E     row_1_col_F
row_2_col_A row_2_col_B     row_2_col_C     row_2_col_D     row_2_col_E     row_2_col_F
row_3_col_A row_3_col_B     row_3_col_C     row_3_col_D     row_3_col_E     row_3_col_F
Time taken: 0.222 seconds, Fetched: 3 row(s)

然后创建一个视图以仅包含所需的列:

hive> CREATE VIEW filtered_ext_table AS SELECT A, C, F FROM ext_table;
OK
Time taken: 0.749 seconds
hive> DESCRIBE filtered_ext_table; 
OK
a                           string                              
c                           string                              
f                           string                              
Time taken: 0.266 seconds, Fetched: 3 row(s)
hive> SELECT * FROM filtered_ext_table;
OK
row_1_col_A row_1_col_C     row_1_col_F
row_2_col_A row_2_col_C     row_2_col_F
row_3_col_A row_3_col_C     row_3_col_F
Time taken: 0.301 seconds, Fetched: 3 row(s)

实现所需目标的另一种方法是,您能够修改支持外部表的 HDFS 文件 - 如果您感兴趣的列都在每行的开头附近,那么您可以定义外部表以仅捕获前 3 列(而不考虑文件中实际有多少列)。 例如,使用与上述相同的数据文件:

hive> DROP TABLE IF EXISTS ext_table;
OK
Time taken: 1.438 seconds
hive> CREATE EXTERNAL TABLE ext_table (
    >   A STRING,
    >   B STRING,
    >   C STRING
    > )
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE
    > LOCATION '/tmp/ext_table';
OK
Time taken: 0.734 seconds
hive> SELECT * FROM ext_table;
OK
row_1_col_A row_1_col_B     row_1_col_C
row_2_col_A row_2_col_B     row_2_col_C
row_3_col_A row_3_col_B     row_3_col_C
Time taken: 0.727 seconds, Fetched: 3 row(s)

我在这里找到了答案

create table tmpdc_ticket(
    SERVICE_ID CHAR(144),
    SERVICE_TYPE CHAR(50),
    CUSTOMER_NAME CHAR(200),
    TELEPHONE_NO CHAR(144),
    ACCOUNT_NUMBER CHAR(144),
    FAULT_STATUS CHAR(50),
    BUSINESS_GROUP CHAR(100)
)
organization external(
    type    oracle_loader
    default directory sample_directory
    access parameters(
        records delimited by newline
        nologfile
        skip 1
        fields terminated by '|'
        missing field values are null
            (DUMMY_1,
             DUMMY_2,
             SERVICE_ID CHAR(144),
             SERVICE_TYPE CHAR(50),
             CUSTOMER_NAME CHAR(200),
             TELEPHONE_NO CHAR(144),
             ACCOUNT_NUMBER CHAR(144),
             FAULT_STATUS CHAR(50),
             BUSINESS_GROUP CHAR(100)
        )
    )
    location(sample_directory:'sample_file.txt')
)
reject limit 1
noparallel
nomonitoring;

相关内容

  • 没有找到相关文章

最新更新