我有一个csv文件,如下所述。
VTS,51,0071,9739965515,NM,GP,INF01,V,19,072219,291014,0000.0000,N,00000.0000,E,07AE
VTS,01,0097,9739965515,SP,GP,18,072253,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,169,B205
VTS,51,0071,9739965515,NM,GP,INF01,V,18,072311,291014,0000.0000,N,00000.0000,E,C24E
VTS,01,0097,9739965515,NM,GP,19,072311,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,171,B358
VTS,51,0071,9739965515,NM,GP,INF01,V,18,072319,291014,0000.0000,N,00000.0000,E,012F
VTS,51,0071,9739965515,NM,GP,INF01,V,19,072326,291014,0000.0000,N,00000.0000,E,B2E6
VTS,01,0097,9739965515,NM,GP,18,072326,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,173,EAA0
VTS,51,0071,9739965515,NM,GP,INF01,V,18,072333,291014,0000.0000,N,00000.0000,E,9896
VTS,51,0071,9739965515,NM,GP,INF01,V,18,072340,291014,0000.0000,N,00000.0000,E,9B23
这必须用字段映射:
pkt_header,gprs_pkt_id,pkt_length,sim_no,msg_id,gprs_pkt,gsm_sig_strength,utc_time,pkt_validation,latitude,direction_n_s,longitude,direction_e_w,speed,track_angle,utc_date,fuel_adc_values,ignition,odometer_values,supply_int,battery_adc,pkt_id,check_sum
第二个字段,即值为01的gprs_pkt_id描述了一个有效的分组。我的用例是只为有效的数据包过滤csv数据,对此我使用regex,但我无法获得整个数据。任何帮助都将不胜感激。
使用的配置单元查询如下所示。
CREATE EXTERNAL TABlE sky_track_testing1(
pkt_header STRING,
gprs_pkt_id STRING,
pkt_length STRING,
sim_no STRING,
msg_id STRING,
gprs_pkt STRING,
gsm_sig_strength STRING,
utc_time STRING,
pkt_validation STRING,
latitude STRING,
direction_n_s STRING,
longitude STRING,
direction_e_w STRING,
speed STRING,
track_angle STRING,
utc_date STRING,
fuel_adc_values STRING,
ignition STRING,
odometer_values STRING,
supply_int STRING,
battery_adc STRING,
pkt_id STRING,
check_sum STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(VTS,01).*$"
)
STORED AS TEXTFILE
LOCATION '/user/root/sky_track';
这绝对是一个错误的查询。请帮帮我。
我建议您使用Pig:
a = load '/user/root/sky_track' as (pkt_header,gprs_pkt_id,pkt_length,sim_no,msg_id,gprs_pkt,gsm_sig_strength,utc_time,pkt_validation,latitude,direction_n_s,longitude,direction_e_w,speed,track_angle,utc_date,fuel_adc_values,ignition,odometer_values,supply_int,battery_adc,pkt_id,check_sum);
b = filter a by gprs_pkt_id == '01';
store b into '/user/root/sky_track_valid';
是的,根据上面的答案,Pig将非常适合您的数据。你可以试试猪。如果您对hive感兴趣,请参阅下面的示例(您的数据集不需要regex)。
hive> CREATE TABLE sky_track_testing1(
> pkt_header STRING,
> gprs_pkt_id STRING,
> pkt_length STRING,
> sim_no STRING,
> msg_id STRING,
> gprs_pkt STRING,
> gsm_sig_strength STRING,
> utc_time STRING,
> pkt_validation STRING,
> latitude STRING,
> direction_n_s STRING,
> longitude STRING,
> direction_e_w STRING,
> speed STRING,
> track_angle STRING,
> utc_date STRING,
> fuel_adc_values STRING,
> ignition STRING,
> odometer_values STRING,
> supply_int STRING,
> battery_adc STRING,
> pkt_id STRING,
> check_sum STRING
> )
> ROW FORMAT
> DELIMITED FIELDS TERMINATED BY ','
> LINES TERMINATED BY 'n'
> STORED AS TEXTFILE;
OK
Time taken: 0.1 seconds
hive> select *from sky_track_testing1 where gprs_pkt_id='01';
OK
VTS 01 0097 9739965515 SP GP 18 072253 V 0000.0000 N 00000.0000 E 0.0 0.0 291014 0000 00 4000 1999 169 B205
VTS 01 0097 9739965515 NM GP 19 072311 V 0000.0000 N 00000.0000 E 0.0 0.0 291014 0000 00 4000 1999 171 B358
VTS 01 0097 9739965515 NM GP 18 072326 V 0000.0000 N 00000.0000 E 0.0 0.0 291014 0000 00 4000 1999 173 EAA0
Time taken: 14.328 seconds