在SQL Server 2016中解析大型表的HTML数据



我试图解析存在于我的表之一的HTML数据,但问题是表的大小是巨大的,我不能使用这里提供的功能。我只是想知道是否有任何有效的方法来解析数据从一个巨大的表。

我在这里尝试了解决方案,它工作得很好,但不能用于巨大的表。https://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/

示例代码如下:http://sqlfiddle.com/#!18/27baba/1

数据如下:

ID  Valuen
1 <t><some>this is the value</t></some>another value . <none> 3rd value </none>
2 <t><bold>Correct value </t></bold>new value . <color> salary </color>

样本输出:

ID     Valuen
1     this is the value another value 3rd value
2     Correct value  new value . salary 

请注意,我尝试使用XML解析,以及,但它不工作,因为它无法将所有的数据转换为XML。

任何帮助都将不胜感激,谢谢。

打开辅助函数

dbFiddle的规范化示例

Select A.ID
,B.*
From  test A
Cross Apply   [dbo].[tvf-Str-Extract-JSON](ValueN,'>','<') B

ID  RetSeq  RetVal
1   1       this is the value
1   2       another value .
1   3       3rd value
2   1       Correct value
2   2       new value .
2   3       salary

同上,只是增加了一个string_agg()

Select A.ID
,B.*
From  test A
Cross Apply  ( Select NewVal = string_agg(RetVal,' ') WITHIN GROUP (ORDER BY RetSeq) 
From [dbo].[tvf-Str-Extract-JSON](ValueN,'>','<')  
)B

结果

ID  NewVal
1   this is the value another value .   3rd value
2   Correct value  new value .   salary

感兴趣的函数

CREATE FUNCTION [dbo].[tvf-Str-Extract-JSON] (@String nvarchar(max),@Delim1 nvarchar(100),@Delim2 nvarchar(100))
Returns Table 
As
Return (  
Select RetSeq = row_number() over (order by RetSeq)
,RetVal = left(RetVal,charindex(@Delim2,RetVal)-1)
From  (
Select RetSeq = [Key]+1
,RetVal = trim(Value)
From  OpenJSON( N'["'+replace(string_escape(@String,'json'),@Delim1,'","')+N'"]' )
) C1
Where charindex(@Delim2,RetVal)>1
)

请在您的数据集上尝试以下解决方案。

/p>

-- DDL and sample data population, start
DECLARE @tbl TABLE (ID int, tokens VARCHAR (500)); 
INSERT INTO @tbl(id, tokens) VALUES 
(1, '<t><some>this is the value</t></some>another value .  <none> 3rd value </none>'),
(2, '<t><bold>Correct value </t></bold>new value .  <color> salary </color>');
-- DDL and sample data population, end
SELECT t.ID
, Result = STRING_AGG(LEFT(token, CHARINDEX('<', token + '<') - 1), SPACE(1))
FROM @tbl AS t
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' + 
REPLACE(tokens, '>', ']]></r><r><![CDATA[') + 
']]></r></root>' AS XML)) AS t1(x)
CROSS APPLY x.nodes('/root/r[text()][not(substring(text()[1],1,1)="<")]/text()') AS t2(c)
CROSS APPLY (SELECT c.value('.','VARCHAR(500)')) AS t3(token)
GROUP BY ID;

<表类>ID结果1这是另一个值。第三个值2正确值新值。工资

最新更新