我有一些XML文件,大小约为50MB,最大为2GB,其中包含数万或数十万个只有文本节点的mycomment
元素。到mycomment
节点的路径不是固定的,也没有定义,因此//mycomment
是获得所有路径的唯一方法。mycomment/text()
的长度约为50到500个字符。我需要在所有mycomment
文本节点中搜索一个模式,以便对文件进行分类。如果在其中一个文本节点中找到模式" mypattern1234 "
,则变量hit
设置为1
,否则为空。像这样计算hit
是一个好的解决方案吗:
<xsl:variable name="hit">
<xsl:if test="//mycomment[contains(text(),' mypattern1234 ')]">1</xsl:if>
</xsl:variable>
:-(我使用的是XSLT v1.0。非常感谢。
除非切换到流式XSLT3.0处理器(如Saxon EE(,否则您将无法处理2GB的输入文档。
如果你使用的是流媒体处理器,那么我建议你将其作为
<xsl:source-document href="input.xml" streamable="yes">
<xsl:if test="//text()[parent::comment][contains(.,' mypattern1234 ')]>1</xsl:if>
</xsl:source-document>
只查看文本节点,然后检查它们的上下文,而不是匹配元素节点,然后设置子文本节点的搜索,这样会减少一点开销。
我预计2Gb的搜索将在一分钟内运行,具体取决于您的硬件。但是,一旦找到匹配项,就应该立即停止对源文档的扫描,因此,如果在文档开头附近发现匹配项,则扫描速度会快得多。
目前尚不清楚mycomment
元素是否可以具有混合内容或与注释或处理指令交错的文本节点。通常,如果您希望mycomment
元素只有文本内容,我会检查mycomment[contains(., ' foo ')]
,不需要选择下至文本节点的子级。如果你想这样做,那么我会使用mycomment/text()[contains(., ' foo ')]
,你检查text()
作为contains
的参数会选择第一个文本子节点,所以在例如<mycomment>foo <!-- a comment --> mypattern1234 </mycomment>
中,文本不会被检测到。
至于效率,这在很大程度上取决于所使用的XSLT处理器。
除了XML技术外,ripgrep
使搜索速度提高了几个数量级:
#!/bin/zsh
DATABASE="$1"
SEARCH_PATTERN="$2"
if [ "${#SEARCH_PATTERN}" != 0 ] && [ "${#DATABASE}" != 0 ] ; then
RAWHIT=`rg -C 5 "$SEARCH_PATTERN" "$DATABASE"`
if [ "${#RAWHIT}" != 0 ] ; then
HIT=`echo $RAWHIT | rg -c -U "<comment>.*$SEARCH_PATTERN.*</comment>"`
if [ "${#HIT}" != 0 ] ; then
echo "Pattern found"
else
echo "Pattern not found"
fi
else
echo "Pattern not found"
fi
else
echo "Missing Search Pattern or Database"
fi
i5平台上的运行和时间测量:
> time ./run.sh DB50C1000000P.xml Foo4711
Pattern found
./run.sh DB50C1000000P.xml Foo4711 0,23s user 0,58s system 98% cpu 0,828 total
> time ./run.sh DB50C1000000P.xml Foo4711a
Pattern not found
./run.sh DB50C1000000P.xml Foo4711a 0,23s user 0,59s system 98% cpu 0,829 total
数据库下载:数据库
比较技术-XPath与XQuery或XSLT与ripgrep:
- 溶液1(M.H.(:
//comment[contains(.,'searchpattern')]
- 解决方案2(M.S.(:
//comment[contains(text(),'searchpattern')]
- 溶液3(M.K.(:
//text()[parent:comment][contains(.,'searchpattern')]
平台:
MacBook,i5,macOS v10.15.4,16GB RAM
XML数据库:
尺寸:2.46GB
注释元素节点的数量(具有将在其中搜索模式的单个文本节点(=5914102
其他元素节点(非注释(的数量=11829597
XQuery v3.1,BaseX v9.3.2
> time java -Xmx10g -cp /Users/ms/Projekte/basex/BaseX.jar org.basex.BaseX -bsolution=1 -bdatabase=DB50C1000000P -bpattern=Foo4711 run.xqy
Solution 1: Pattern found
java -Xmx10g -cp /Users/ms/Projekte/basex/BaseX.jar org.basex.BaseX run.xq 28,96s user 2,82s system 188% cpu 16,826 total
> time java -Xmx10g -cp /Users/ms/Projekte/basex/BaseX.jar org.basex.BaseX -bsolution=1 -bdatabase=DB50C1000000P -bpattern=Foo4711a run.xqy
Solution 1: Pattern not found
java -Xmx10g -cp /Users/ms/Projekte/basex/BaseX.jar org.basex.BaseX run.xq 42,62s user 4,05s system 140% cpu 33,233 total
> time java -Xmx10g -cp /Users/ms/Projekte/basex/BaseX.jar org.basex.BaseX -bsolution=2 -bdatabase=DB50C1000000P -bpattern=Foo4711 run.xqy
Solution 2: Pattern found
java -Xmx10g -cp /Users/ms/Projekte/basex/BaseX.jar org.basex.BaseX run.xq 29,25s user 2,70s system 196% cpu 16,271 total
> time java -Xmx10g -cp /Users/ms/Projekte/basex/BaseX.jar org.basex.BaseX -bsolution=2 -bdatabase=DB50C1000000P -bpattern=Foo4711a run.xqy
Solution 2: Pattern not found
java -Xmx10g -cp /Users/ms/Projekte/basex/BaseX.jar org.basex.BaseX run.xq 47,45s user 4,84s system 143% cpu 36,341 total
> time java -Xmx10g -cp /Users/ms/Projekte/basex/BaseX.jar org.basex.BaseX -bsolution=3 -bdatabase=DB50C1000000P -bpattern=Foo4711 run.xqy
Solution 3: Pattern found
java -Xmx10g -cp /Users/ms/Projekte/basex/BaseX.jar org.basex.BaseX run.xq 30,09s user 2,70s system 195% cpu 16,773 total
> time java -Xmx10g -cp /Users/ms/Projekte/basex/BaseX.jar org.basex.BaseX -bsolution=3 -bdatabase=DB50C1000000P -bpattern=Foo4711a run.xqy
Solution 3: Pattern not found
java -Xmx10g -cp /Users/ms/Projekte/basex/BaseX.jar org.basex.BaseX run.xq 45,42s user 4,32s system 148% cpu 33,413 total
XSLT v3.0,SaxonEE v9-9-1-7J
> time java -Xmx10g -jar /Users/ms/Projekte/SaxonEE9-9-1-7J/saxon9ee.jar -s:empty.xml -xsl:run.xsl -o:out.xml database=DB50C1000000P.xml pattern=Foo4711 solution=1
Solution 1: Pattern found
java -Xmx10g -jar /Users/ms/Projekte/SaxonEE9-9-1-7J/saxon9ee.jar -s:empty.xm 27,43s user 5,88s system 134% cpu 24,719 total
> time java -Xmx10g -jar /Users/ms/Projekte/SaxonEE9-9-1-7J/saxon9ee.jar -s:empty.xml -xsl:run.xsl -o:out.xml database=DB50C1000000P.xml pattern=Foo4711a solution=1
Solution 1: Pattern not found
java -Xmx10g -jar /Users/ms/Projekte/SaxonEE9-9-1-7J/saxon9ee.jar -s:empty.xm 30,28s user 9,06s system 131% cpu 29,964 total
> time java -Xmx10g -jar /Users/ms/Projekte/SaxonEE9-9-1-7J/saxon9ee.jar -s:empty.xml -xsl:run.xsl -o:out.xml database=DB50C1000000P.xml pattern=Foo4711 solution=2
Solution 2: Pattern found
java -Xmx10g -jar /Users/ms/Projekte/SaxonEE9-9-1-7J/saxon9ee.jar -s:empty.xm 27,55s user 4,44s system 158% cpu 20,202 total
> time java -Xmx10g -jar /Users/ms/Projekte/SaxonEE9-9-1-7J/saxon9ee.jar -s:empty.xml -xsl:run.xsl -o:out.xml database=DB50C1000000P.xml pattern=Foo4711a solution=2
Solution 2: Pattern not found
java -Xmx10g -jar /Users/ms/Projekte/SaxonEE9-9-1-7J/saxon9ee.jar -s:empty.xm 34,47s user 5,24s system 177% cpu 22,324 total
> time java -Xmx10g -jar /Users/ms/Projekte/SaxonEE9-9-1-7J/saxon9ee.jar -s:empty.xml -xsl:run.xsl -o:out.xml database=DB50C1000000P.xml pattern=Foo4711 solution=3
Solution 3: Pattern found
java -Xmx10g -jar /Users/ms/Projekte/SaxonEE9-9-1-7J/saxon9ee.jar -s:empty.xm 16,13s user 0,62s system 130% cpu 12,816 total
> time java -Xmx10g -jar /Users/ms/Projekte/SaxonEE9-9-1-7J/saxon9ee.jar -s:empty.xml -xsl:run.xsl -o:out.xml database=DB50C1000000P.xml pattern=Foo4711a solution=3
Solution 3: Pattern not found
java -Xmx10g -jar /Users/ms/Projekte/SaxonEE9-9-1-7J/saxon9ee.jar -s:empty.xm 47,26s user 1,56s system 110% cpu 44,247 total
结果
好的案例:至少有一条带有搜索模式的评论
最快(16,13秒(:XSLTv3.0、SaxonEEv9-9-1-7J 上的解决方案3(M.K.(
最慢(30,09(:XQuery v3.1、BaseX v9.3.2 上的解决方案3(M.K.(
错误案例:对搜索模式没有任何评论
最快(30,28s(:XQuery v3.1、BaseX v9.3.2 上的解决方案1(M.H.(
最慢(47,26秒(:XQuery v3.1、BaseX v9.3.2 上的解决方案3(M.K.(
脚本和XML数据库
https://gitlab.com/ms152718212/xslxqyfindpattern
采用非XML技术的解决方案
抛石
> time rg -C 5 'Foo4711' DB50C1000000P.xml | rg -c -U '<comment>(?s:[^<]*)Foo4711(?s:[^<]*)</comment>'
4
rg -C 5 'Foo4711' DB50C1000000P.xml 0,27s user 0,72s system 90% cpu 1,094 total
rg -c -U '<comment>(?s:[^<]*)Foo4711(?s:[^<]*)</comment>' 0,01s user 0,01s system 1% cpu 1,092 total
> time rg -C 5 'Foo4711a' DB50C1000000P.xml | rg -c -U '<comment>(?s:[^<]*)Foo4711a(?s:[^<]*)</comment>'
rg -C 5 'Foo4711a' DB50C1000000P.xml 0,24s user 0,66s system 93% cpu 0,957 total
rg -c -U '<comment>(?s:[^<]*)Foo4711a(?s:[^<]*)</comment>' 0,01s user 0,01s system 1% cpu 0,957 total