将 JSON 数组导入 Hive



>我正在尝试在 hive 中导入以下 json

[{"时间":1521115600,"纬度":44.3959,"经度":26.1025,"海拔":53,"PM1":21.70905,"PM25":16.5,"PM10":14.60085,"gas1":0,"gas2":0.12,"gas3":0,"gas4":0,"温度":空,"压力":0,"湿度":0,"噪音":0},{"时间":1521115659,"纬度":44.3959,"经度":26.1025,"海拔":53,"pm1":24.34045,"pm25":18.5,"pm10":16.37065,"gas1":0,"gas2":0.08,"gas3":0,"gas4":0,"温度":空,"压力":0,"湿度": 0,"噪音":0},{"时间":1521115720,"纬度":44.3959,"经度":26.1025,"海拔":53,"pm1":23.6826,"pm25":18,"pm10":15.9282,"gas1":0,"gas2":0,"gas3":0,"gas4":0,"温度":空,"压力":0,"湿度":0,"噪音":0},{"时间":1521115779,"纬度":44.3959,"经度":26.1025,"海拔":53,"pm1":25.65615,"pm25":19.5,"pm10":17.25555,"gas1":0,"gas2":0.04,"gas3":0,"gas4":0,"温度":空,"压力":0,"湿度":0,"噪音":0}]

CREATE TABLE json_serde (
s array<struct<time: timestamp, latitude: string, longitude: string, pm1: string>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.value' = 'value'
)
STORED AS TEXTFILE
location '/user/hduser';

导入有效,但如果我尝试

Select * from json_serde;

它将从Hadoop/USER/HDUSER上的每个文档中仅返回每个文件的第一个元素。

有一个关于使用 JSON 数组的好文档??

如果我可以建议您另一种方法是将整个JSON字符串作为String数据类型加载到外部表中的列中。唯一的限制是正确定义LINES TERMINATED BY。例如,如果您可以在一行中将每个 json 放在一行中,那么您可以创建如下表:

例如

CREATE EXTERNAL TABLE json_data_table (
json_data String
)   
ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' LINES TERMINATED BY 'n' STORED AS TEXTFILE 
LOCATION '/path/to/json';

使用 Hiveget_json_object提取单个列。 此命令支持基本xPath,例如查询到 JSON 字符串 例如

如果json_data列具有以下 JSON 字符串

{"store":
{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red"}
},
"email":"amy@only_for_json_udf_test.net",
"owner":"amy"
}

以下查询提取

SELECT get_json_object(json_data, '$.owner') FROM json_data_table;

返回amy

通过这种方式,您可以从表中将每个json元素作为列提取。

你有一个结构数组。您粘贴的内容只有一行。

如果你想看到所有的元素,你需要使用内联

SELECT inline(s) FROM json_table;

或者,您需要重写文件,以便该数组中的每个对象都是其自己的文件行上的单个 JSON 对象

另外,我在您的数据中看不到值字段,所以我不确定您在 serde 属性中映射了什么

您提供的 JSON 不正确。JSON 始终以左大括号"{"开头,以结尾大括号"}"结尾。 因此,这里首先要注意的是您的 JSON 是错误的。

您的 JSON 应该如下所示:

{"key":[{"key1":"value1","key2":"value2"},{"key1":"value1","key2":"value2""},{"key1":"value1","key2":"value2"}]}

而且,第二件事是您已将"time"字段的数据类型声明为时间戳。但数据 (1521115600) 以毫秒为单位。时间戳数据类型需要格式为 YYYY-MM-DD HH:MM:SS[.ffffff] 的数据。

因此,理想情况下,您的数据应采用以下格式:

{">

myjson":[{"time":"1970-01-18 20:01:55","纬度":44.3959,"经度":26.1025,"海拔":53,"pm1":21.70905,"pm25":16.5,"pm10":14.60085,"gas1":0,"gas2":0.12,"gas3":0,"gas4":0,"温度":空,"压力":0,"湿度":0,"噪音":0},{"时间":"1970-01-18 20:01:55","纬度":44.3959,"经度":26.1025,"海拔":53,"pm1":24.34045,"pm25":18.5,"pm10":16.37065,"gas1":0,"gas2":0.08,"gas3":0,"gas4":0,"温度":空,"压力":0,"湿度":0,"噪音":0},{"时间":"1970-01-18 20:01:55","纬度":44.3959,"经度":26.1025,"海拔":53,"pm1":23.6826,"pm25":18,"pm10":15.9282,"gas1":0,"gas2":0,"gas3":0,"gas4":0,"温度":空,"压力":0,"湿度":0,"噪音":0},{"时间":"1970-01-18 20:01:55","纬度":44.3959,"经度":26.1025,"海拔":53,"pm1":25.65615,"pm25":19.5,"pm10":17.25555,"gas1

":0,"gas2":0.04,"gas3":0,"gas4":0,"温度":空,"压力":0,"湿度":0,"噪音":0}]}

现在,您可以使用查询从表中选择记录。

hive> select * from json_serde;
OK
[{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"21.70905"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"24.34045"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"23.6826"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"25.65615"}]
Time taken: 0.069 seconds, Fetched: 1 row(s)
hive>

如果希望每个值以表格格式单独显示,可以使用以下查询。

select b.* from json_serde a lateral view outer inline (a.myjson) b;

上述查询的结果如下所示:

+------------------------+-------------+--------------+-----------+--+
|         b.time         | b.latitude  | b.longitude  |   b.pm1   |
+------------------------+-------------+--------------+-----------+--+
| 1970-01-18 20:01:55.0  | 44.3959     | 26.1025      | 21.70905  |
| 1970-01-18 20:01:55.0  | 44.3959     | 26.1025      | 24.34045  |
| 1970-01-18 20:01:55.0  | 44.3959     | 26.1025      | 23.6826   |
| 1970-01-18 20:01:55.0  | 44.3959     | 26.1025      | 25.65615  |
+------------------------+-------------+--------------+-----------+--+

美丽。不是吗?

快乐学习。

如果您不能使用更新输入文件格式,则可以直接导入 Spark 并使用它,一旦数据完成,就写回 Hive 表。

scala> val myjs = spark.read.format("json").option("path","file:///root/tmp/test5").load()
myjs: org.apache.spark.sql.DataFrame = [altitude: bigint, gas1: bigint ... 13 more fields]
scala> myjs.show()
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+
|altitude|gas1|gas2|gas3|gas4|humidity|latitude|longitude|noise|     pm1|    pm10|pm25|pressure|temperature|      time|
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+
|      53|   0|0.12|   0|   0|       0| 44.3959|  26.1025|    0|21.70905|14.60085|16.5|       0|       null|1521115600|
|      53|   0|0.08|   0|   0|       0| 44.3959|  26.1025|    0|24.34045|16.37065|18.5|       0|       null|1521115659|
|      53|   0| 0.0|   0|   0|       0| 44.3959|  26.1025|    0| 23.6826| 15.9282|18.0|       0|       null|1521115720|
|      53|   0|0.04|   0|   0|       0| 44.3959|  26.1025|    0|25.65615|17.25555|19.5|       0|       null|1521115779|
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+

scala> myjs.write.json("file:///root/tmp/test_output")

或者,您可以直接配置单元表

scala> myjs.createOrReplaceTempView("myjs")
scala> spark.sql("select * from myjs").show()
scala> spark.sql("create table tax.myjs_hive as select * from myjs")

最新更新