实际上不是将任意 XML 导入 SQL Server 的副本
我的公司有 20 GB 的 XML 文件,他们想要对这些文件进行一些数据挖掘。他们将使用的分析工具是SAS,我从未使用过 - 其他人将进行实际挖掘。我的工作是找到一种方法将XML文件转换为相对扁平的数据结构,以便将它们导入SAS。这些文件在六年的时间里来自六个不同的来源。虽然它们名义上都描述了同一件事 - 信用查询的(非常详细的)结果 - 但它们并不遵循一致的格式,即使文件来自同一来源,因为文档的版本随着时间的推移发生了显着变化。没有可用的 XSL、XSD 或 XSLT 文档。
答案似乎是"你想要一个文档数据库",但显然SAS需要一些扁平的东西,比如CSV或其他宽表结构,或者一些关系的东西。我的经验主要是在SQL Server上,但如果有针对其他平台的解决方案,我们绝对对此持开放态度。我们甚至研究过使用 Microsoft Excel,但它没有正确解释文件(它解析得很好,但它给列起了无意义的名称)。
我考虑了编写 C# 代码以基于 XML 数据生成 SQL 模式的想法,并希望至少在单个源的范围内,结构可以足够一致以适合所有文件。我已经研究了使用 SQLXML 大容量加载来生成表,但这需要一个 SQL 注释的 XSD 架构,并且似乎没有任何工具来生成它。
我们已经研究了如何使用 xml 数据类型方法将数据放入如下所示的表中:
CREATE TABLE ResponseData
(
CustomerID INT,
NodePath VARCHAR(500),
Position SMALLINT,
Value VARCHAR(500)
)
但觉得一定有办法获得比这更有用的信息分离。
关于如何将已知的 XML 文档转换为 SQL 的信息(包括几个 SO 问题),有很多信息,但我需要知道如何导入任意 XML 文档。"导入具有未知结构的XML"出现了一些建议的工具,但它们的输出并没有那么有用。
任何帮助将不胜感激!
这可能是显而易见的,但我认为您必须从打开不同时间段的几个文件开始,并尝试了解您正在处理多少"模式"(在 XML 意义上)。 然后,您可以编写一些代码来系统地读取文件,尝试识别它们的"架构"并记录与任何已知类型都不匹配的文件。 目标是弄清楚您真正拥有多少种类型的文档;之后,您可以担心如何将它们放入数据库中,一次一种类型,希望确定可以完全表示任何类型的文档的单个数据库模式。 我知道我在这里没有说太多技术问题,但我认为你现在面临的是一个战略问题,而不是一个技术问题。
我会在这里列出这些,即使它们对您没有帮助,它们也可能会在将来寻找类似解决方案时帮助其他人。
我们在 SAS 中使用以下两个宏从 XML 信用查询中提取某些属性、元素、值等。 我提供了一些例子,希望这些例子能帮助解释它们是如何工作的。 我现在没有时间详细讨论它,但想提供一些您可能会发现有用的东西。 如果您将这些提供给分析师,他们应该能够按原样运行代码,并完成示例和参数,为自己提取一些信息以进行初步调查,并可能为您提供更具体的要求。
以下宏的唯一条件是 XML 不超过 32767 个字符,并且它全部位于 SAS 中单行的单个字符观察中(即不存储在 SAS 中的多个观察中)。
他们不应该真正需要了解宏的工作原理,他们只需要了解如何调用和使用它们。
/*****************************************************************************
** PROGRAM: MACROS.PRXCOUNT.SAS
**
** RETURNS THE NUMBER OF TIMES A SEGMENT IS FOUND IN AN XML FILE.
**
** PARAMETERS:
** iElement : The element to search through the blob for.
** iXMLField : The name of the field that contains the XML blob to parse.
** iDelimiterType: (1 or 2). Defaults to 1. 1 USES <> AS DELIMS. 2 USES [].
**
******************************************************************************
** HISTORY:
** 1.0 MODIFIED: 25-FEB-2011 BY:RP
** - CREATED.
** 1.1 MODIFIED: 14-MAR-2011 BY:RP
** - MODIFIED TO ALLOW FOR OPTIONAL ATTRIBUTES ON THE ELEMENT BEING SEARCHED FOR.
*****************************************************************************/
%macro prxCount(iElement=, iXMLField=, iDelimiterType=1);
%local delim_open delim_close;
crLf = byte(10) || byte(13);
&iXMLField = compress(&iXMLField,crLf,);
%if &iDelimiterType eq 1 %then %do;
%let delim_open = <;
%let delim_close = >;
%end;
%else %if &iDelimiterType eq 2 %then %do;
%let delim_open = [;
%let delim_close = ];
%end;
%else %if &iDelimiterType eq 3 %then %do;
%let delim_open = %nrbquote(&)lt%quote(%str(;)) ;
%let delim_close = %nrbquote(&)gt%quote(%str(;)) ;
%end;
%else %do;
%put ERR%str()ROR (prxCount.sas): You specified an incorrect option for the iDelimiterType parameter.;
%end;
prx_id = prxparse("/&delim_open&iElement(s+.*?&delim_close|&delim_close){1}?(.*?)&delim_open/&iElement&delim_close/i");
prx_count = 0;
prx_start = 1;
prx_stop = length(&iXMLField);
call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
do while (prx_pos > 0);
prx_count = prx_count + 1;
call prxposn(prx_id, 1, prx_pos, prx_length);
call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
end;
drop crLf prx_:;
%mend;
/*****************************************************************************
** PROGRAM: PRXEXTRACT.SAS
**
** SEARCHES THROUGH AN XML (OR HTML) FILE FOR AN ELEMENT AND EXTRACTS THE
** VALUE BETWEEN AN ELEMENTS TAGS.
**
** PARAMETERS:
** iElement : The element to search through the blob for.
** iField : The fieldname to save the result to.
** iType : (N or C) for Numeric or Character.
** iLength : The length of the field to create.
** iXMLField : The name of the field that contains the XML blob to parse.
** iDelimiterType: (1 or 2). Defaults to 1. 1 USES <> AS DELIMS. 2 USES [].
**
******************************************************************************
** HISTORY:
** 1.0 MODIFIED: 14-FEB-2011 BY:RP
** - CREATED.
** 1.1 MODIFIED: 16-FEB-2011 BY:RP
** - ADDED OPTION TO CHANGE DELIMITERS FROM <> TO []
** 1.1 MODIFIED: 17-FEB-2011 BY:RP
** - CORRECTED ERROR WHEN MATCH RETURNS A LENGTH OF ZERO
** - CORRECTED MISSING AMPERSAND FROM IDELIMITERTYPE CHECK.
** - ADDED ESCAPING QUOTES TO [] DELIMITER TYPE
** - CORRECTED WARNING WHEN MATCH RETURNS MISSING NUMERIC FIELD
** 1.2 MODIFIED: 25-FEB-2011 BY:RP
** - ADDED DELIMITER TYPES TO WORK WITH MASKED HTML CODES
** 1.3 MODIFIED: 11-MAR-2011 BY:RP
** - MODIFIED TO ALLOW FOR OPTIONAL ATTRIBUTES ON THE ELEMENT BEING SEARCHED FOR.
** 1.4 MODIFIED: 14-MAR-2011 BY:RP
** - CORRECTED TO REMOVE FALSE MATCHES FROM PRIOR VERSION. ADDED EXAMPLE.
** 1.5 MODIFIED: 10-APR-2012 BY:RP
** - CORRECTED PROBLEM WITH ZERO LENGTH STRING MATCHES
** 1.6 MODIFIED: 22-MAY-2012 BY:RP
** - ADDED ABILITY TO CAPTURE ATTRIBUTES
*****************************************************************************/
%macro prxExtract(iElement=, iField=, iType=, iLength=, iXMLField=, iDelimiterType=1, iSequence=1, iAttributesField=);
%local delim_open delim_close;
crLf = byte(10) || byte(13);
&iXMLField = compress(&iXMLField,crLf,);
%if &iDelimiterType eq 1 %then %do;
%let delim_open = <;
%let delim_close = >;
%end;
%else %if &iDelimiterType eq 2 %then %do;
%let delim_open = [;
%let delim_close = ];
%end;
%else %if &iDelimiterType eq 3 %then %do;
%let delim_open = %nrbquote(&)lt%quote(%str(;)) ;
%let delim_close = %nrbquote(&)gt%quote(%str(;)) ;
%end;
%else %do;
%put ERR%str()ROR (prxExtract.sas): You specified an incorrect option for the iDelimiterType parameter.;
%end;
%if %sysfunc(index(&iField,[)) %then %do;
/* DONT DO THIS IF ITS AN ARRAY */
%end;
%else %do;
%if "%upcase(&iType)" eq "N" %then %do;
attrib &iField length=&iLength format=best.;
%end;
%else %do;
attrib &iField length=$&iLength format=$&iLength..;
%end;
%end;
/*
** BREAKDOWN OF REGULAR EXPRESSION (EXAMPLE USES < AND > AS DELIMS AND ANI AS THE ELEMENT BEING LOOKED FOR:
**
** &delim_open&iElement --> FINDS <ANI
** (s+.*?&delim_close|&delim_close){1}? --> FINDS THE SHORTEST SINGLE INSTANCE OF EITHER:
** - ONE OR MORE SPACES FOLLOWED BY ANYTHING UNTIL A > CHARACTER
** - OR JUST A > CHARACTER
** THE ?: JUST TELLS IT NOT TO CAPTURE WHAT IT FOUND INBETWEEN THE ( AND )
** (.*?) --> FINDS WHAT WE ARE SEARCHING FOR AND CAPTURES IT INTO BUFFER 1.
** &delim_open --> FINDS <
** / --> FINDS THE / CHARACTER. THE FIRST SLASH ESCAPES IT SO IT KNOWS ITS NOT A SPECIAL REGEX SLASH
** &iElement&delim_close --> FINDS ANI>
*/
prx_id = prxparse("/&delim_open&iElement((s+.*?)&delim_close|&delim_close){1}?(.*?)&delim_open/&iElement&delim_close/i");
prx_start = 1;
prx_stop = length(&iXMLField);
prx_sequence = 0;
call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
do while (prx_pos > 0);
prx_sequence = prx_sequence + 1;
if prx_sequence = &iSequence then do;
if prx_length > 0 then do;
call prxposn(prx_id, 3, prx_pos, prx_length);
%if "%upcase(&iType)" eq "N" %then %do;
length prx_tmp_n $200;
prx_tmp_n = substr(&iXMLField, prx_pos, prx_length);
if cats(prx_tmp_n) ne "" then do;
&iField = input(substr(&iXMLField, prx_pos, prx_length), ?best.);
end;
%end;
%else %do;
if prx_length ne 0 then do;
&iField = substr(&iXMLField, prx_pos, prx_length);
end;
else do;
&iField = "";
end;
%end;
**
** ALSO SAVE THE ATTRIBUTES TO A FIELD IF REQUESTED
*;
%if "%upcase(&iAttributesField)" ne "" %then %do;
call prxposn(prx_id, 2, prx_pos, prx_length);
if prx_length ne 0 then do;
&iAttributesField = substr(&iXMLField, prx_pos, prx_length);
end;
else do;
&iAttributesField = "";
end;
%end;
end;
end;
call prxnext(prx_id, prx_start, prx_stop, &iXMLField, prx_pos, prx_length);
end;
drop crLf prx:;
%mend;
单个元素的示例:
data example;
xml = "<test><ANI2Digits>00</ANI2Digits><XNI xniattrib=1>7606256091</XNI><ANI>number2</ANI><ANI x=hmm y=yay>number3</ANI></test>"; * NOTE THE XML MUST BE ALL ON ONE LINE;
%prxExtract(iElement=xni, iField=my_xni, iType=c, iLength=15, iXMLField=xml, iSequence=1, iAttributesField=my_xni_attribs);
run;
重复元素的示例:
data example;
xml = "<test><ANI2Digits>00</ANI2Digits><ANI>7606256091</ANI><ANI>number2</ANI><ANI x=hmm y=yay>number3</ANI></test>"; * NOTE THE XML MUST BE ALL ON ONE LINE;
%prxExtract(iElement=ani2digits, iField=ani2digits, iType=c, iLength=50, iXMLField=xml);
length ani1-ani6 $15;
length attr1-attr6 $100;
array arrani [1:6] $ ani1-ani6;
array arrattr [1:6] $ attr1-attr6;
%prxCount (iElement=ani, iXMLField=xml, iDelimiterType=1);
do cnt=1 to prx_count;
%prxExtract(iElement=ani, iField=arrani[cnt], iType=c, iLength=15, iXMLField=xml, iSequence=cnt, iAttributesField=arrattr[cnt]);
end;
run;