HDP-2.5.0.0使用Ambari 2.4.0.1
配置单元表ReportSetting如下所示:
id int
serializedreportsetting String
"serializedreportsetting"列是源SQL Server数据库中的XML数据类型,但在Sqoop导入过程中转换为String,这就是它在SQL Server中的样子:
<ReportSettings4 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Trigger>
<Manual>true</Manual>
</Trigger>
<StartTime>
<Year>8</Year>
<Month>1</Month>
<Day>1</Day>
<Hour>0</Hour>
<Minute>0</Minute>
</StartTime>
<ReportPeriod>
<Month>0</Month>
<Day>0</Day>
<Hour>0</Hour>
<Minute>5</Minute>
</ReportPeriod>
<Theft>
<DigitalInput>true</DigitalInput>
<Can>false</Can>
</Theft>
<SequenceNo>0</SequenceNo>
</ReportSettings4>
在配置单元表中:
<ReportSettings4 xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><Trigger><Manual>true</Manual></Trigger><StartTime><Year>8</Year><Month>12</Month><Day>31</Day><Hour>23</Hour><Minute>34</Minute></StartTime><ReportPeriod><Month>0</Month><Day>0</Day><Hour>4</Hour><Minute>0</Minute></ReportPeriod><Theft><DigitalInput>false</DigitalInput><Can>false</Can></Theft><SequenceNo>3</SequenceNo></ReportSettings4>
在SQL Server上运行良好的查询:
SELECT
r.VehicleId
,rs.value('(Trigger/Manual)[1]', 'bit') AS RS_Trigger_Manual, ,CAST(CONCAT(CASE WHEN rs.value('(StartTime/Year)[1]', 'int') < 10 THEN CONCAT('200',rs.value('(StartTime/Year)[1]', 'int')) ELSE CONCAT('20',rs.value('(StartTime/Year)[1]', 'int')) END,'-',rs.value('(StartTime/Month)[1]', 'int'),'-',rs.value('(StartTime/Day)[1]', 'int'),' ',rs.value('(StartTime/Hour)[1]', 'int'),':',rs.value('(StartTime/Minute)[1]', 'int'),':','00.000') AS datetime) AS RS_StartTime
,rs.value('(ReportPeriod/Month)[1]', 'int') AS RS_ReportPeriod_Month
,rs.value('(ReportPeriod/Day)[1]', 'int') AS RS_ReportPeriod_Day
,rs.value('(ReportPeriod/Hour)[1]', 'int') AS RS_ReportPeriod_Hour
,rs.value('(ReportPeriod/Minute)[1]', 'int') AS RS_ReportPeriod_Minute
,rs.value('(Theft/DigitalInput)[1]', 'bit') AS RS_Theft_DigitalInput
,rs.value('(Theft/Can)[1]', 'bit') AS RS_Theft_Can,rs.value('(SequenceNo)[1]', 'int')
AS RS_SequenceNo FROM ReportSetting r
CROSS APPLY SerializedReportSetting.nodes('/*') AS ReportSettings(rs)
我可以考虑/做以下事情:
- 要使用CROSS APPLY,我想需要使用横向视图,这里我没有将serializedportsetting作为数组,所以explode((不起作用。有人能验证我是否朝着正确的方向思考吗
我只是尝试使用内置的xpath-udf将serializedportsetting中的数据获取为列,然而,我没有得到任何记录,一些尝试如下:
从ReportSetting限制1中选择xpath(SerializedReportSetting,'/*'(;
从ReportSetting限制1中选择xpath(SerializedReportSetting,'/ReportSettings4'(;
从ReportSetting限制1中选择xpath(SerializedReportSetting,'/Trigger/Manual'(;
**********更新-1**********
我使用regexp_replace来处理上述挑战:
SELECT id,
xpath_string(SerializedReportSetting,'/ReportSettings/Trigger/Manual') AS RS_Trigger_Manual,
xpath_string(SerializedReportSetting,'/ReportSettings/Trigger/DriveChange') AS RS_Trigger_DriveChange
FROM
(SELECT id,
regexp_replace(SerializedReportSetting, 'ReportSettings+\d','ReportSettings') AS SerializedReportSetting
FROM reportsetting
WHERE id IN (1701548,3185,1700231,1700232)
) reportsetting_regex;
在xpath中,他们明确表示:
xpath()
函数总是返回一个字符串的配置单元数组。如果表达式产生非文本值(例如,另一个xml节点(,则函数将返回一个空数组
因此,您可以使用:xpath(SerializedReportSetting,'/ReportSettings4/Trigger/Manual/text()') from ReportSetting limit 1;
或者更好的选择是使用xpath_boolean
/xpath_int
:
xpath_boolean-如果XPath表达式的计算结果为true,或者找到匹配的节点,则返回true。
xpath_boolean(SerializedReportSetting,'/ReportSettings4/Trigger/Manual') from ReportSetting limit 1;
xpath_short、xpath_int、xpath_long这些函数返回一个整数值,如果未找到匹配项,则返回值零,或者找到匹配项但该值不是数字。支持数学运算。如果值溢出返回类型,则返回该类型的最大值。
xpath_int(SerializedReportSetting,'/ReportSettings4/ReportPeriod/Month') from ReportSetting limit 1;