我试图解析存在于我的表之一的HTML数据,但问题是表的大小是巨大的,我不能使用这里提供的功能。我只是想知道是否有任何有效的方法来解析数据从一个巨大的表。
我在这里尝试了解决方案,它工作得很好,但不能用于巨大的表。https://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/
示例代码如下:http://sqlfiddle.com/#!18/27baba/1
数据如下:
ID Valuen
1 <t><some>this is the value</t></some>another value . <none> 3rd value </none>
2 <t><bold>Correct value </t></bold>new value . <color> salary </color>
样本输出:
ID Valuen
1 this is the value another value 3rd value
2 Correct value new value . salary
请注意,我尝试使用XML解析,以及,但它不工作,因为它无法将所有的数据转换为XML。
任何帮助都将不胜感激,谢谢。
打开辅助函数
dbFiddle的规范化示例
Select A.ID
,B.*
From test A
Cross Apply [dbo].[tvf-Str-Extract-JSON](ValueN,'>','<') B
ID RetSeq RetVal
1 1 this is the value
1 2 another value .
1 3 3rd value
2 1 Correct value
2 2 new value .
2 3 salary
同上,只是增加了一个string_agg()
Select A.ID
,B.*
From test A
Cross Apply ( Select NewVal = string_agg(RetVal,' ') WITHIN GROUP (ORDER BY RetSeq)
From [dbo].[tvf-Str-Extract-JSON](ValueN,'>','<')
)B
结果
ID NewVal
1 this is the value another value . 3rd value
2 Correct value new value . salary
感兴趣的函数
CREATE FUNCTION [dbo].[tvf-Str-Extract-JSON] (@String nvarchar(max),@Delim1 nvarchar(100),@Delim2 nvarchar(100))
Returns Table
As
Return (
Select RetSeq = row_number() over (order by RetSeq)
,RetVal = left(RetVal,charindex(@Delim2,RetVal)-1)
From (
Select RetSeq = [Key]+1
,RetVal = trim(Value)
From OpenJSON( N'["'+replace(string_escape(@String,'json'),@Delim1,'","')+N'"]' )
) C1
Where charindex(@Delim2,RetVal)>1
)
请在您的数据集上尝试以下解决方案。
/p>-- DDL and sample data population, start
DECLARE @tbl TABLE (ID int, tokens VARCHAR (500));
INSERT INTO @tbl(id, tokens) VALUES
(1, '<t><some>this is the value</t></some>another value . <none> 3rd value </none>'),
(2, '<t><bold>Correct value </t></bold>new value . <color> salary </color>');
-- DDL and sample data population, end
SELECT t.ID
, Result = STRING_AGG(LEFT(token, CHARINDEX('<', token + '<') - 1), SPACE(1))
FROM @tbl AS t
CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' +
REPLACE(tokens, '>', ']]></r><r><![CDATA[') +
']]></r></root>' AS XML)) AS t1(x)
CROSS APPLY x.nodes('/root/r[text()][not(substring(text()[1],1,1)="<")]/text()') AS t2(c)
CROSS APPLY (SELECT c.value('.','VARCHAR(500)')) AS t3(token)
GROUP BY ID;
ID 结果 1 这是另一个值。第三个值 表类>2 正确值新值。工资