从我的文件中"junk"的字符串报告中解析XML,我不知道如何找到它



我正在尝试用元素树解析一个XML字符串。这个字符串来自许多连接在一起的dict值。没有根节点,但第一次运行良好。

我第一次这么做的时候就奏效了:

for value in data.values():
myxml = ' '.join(value)
tree = ET.fromstring(myxml)

但对于同样的情况,只是另一本字典,它不起作用。我的代码很简单:

values = [x for x in dict_fasi.values()]
myxml_fasi = ' '.join(values)
tree2 = ET.fromstring(myxml_fasi)

我也像以前一样尝试了循环,但没有成功。错误显示:xml.etree.ElementTree.ParseError:文档元素后的垃圾:第8行第20列

第8行应为:

</new_line> <new_line>

XML字符串是:

<new_line>
<text font="NUMPTY+ImprintMTnum" bbox="297.284,540.828,300.188,553.310" colourspace="DeviceGray" ncolour="0" size="12.482">della quale non conosce che una parte;] </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="322.455,540.839,328.251,553.566" colourspace="DeviceGray" ncolour="0" size="12.727">prima</text>
<text font="NUMPTY+ImprintMTnum" bbox="331.206,545.345,334.683,552.834" colourspace="DeviceGray" ncolour="0" size="7.489">1</text>
<text font="NUMPTY+ImprintMTnum" bbox="177.602,528.028,180.850,540.510" colourspace="DeviceGray" ncolour="0" size="12.482">che nonconosce ancora appieno;</text>
<text font="NUMPTY+ImprintMTnum" bbox="189.430,532.545,192.908,540.034" colourspace="DeviceGray" ncolour="0" size="7.489">2</text>
<text font="NUMPTY+ImprintMTnum" bbox="203.879,528.028,208.975,540.510" colourspace="DeviceGray" ncolour="0" size="12.482">che</text>
</new_line> <new_line>
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="315.109,462.272,319.863,472.957" colourspace="DeviceGray" ncolour="0" size="10.685">5</text>
<text font="NUMPTY+ImprintMTnum" bbox="368.916,461.828,372.743,474.310" colourspace="DeviceGray" ncolour="0" size="12.482">avederci]</text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="86.577,449.039,92.373,461.766" colourspace="DeviceGray" ncolour="0" size="12.727">sps.a</text>
<text font="NUMPTY+ImprintMTnum" bbox="167.611,449.028,172.707,461.510" colourspace="DeviceGray" ncolour="0" size="12.482">dove io andava a</text>
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="68.031,421.672,72.786,432.357" colourspace="DeviceGray" ncolour="0" size="10.685">5</text>
<text font="NUMPTY+ImprintMTnum" bbox="137.296,421.228,140.200,433.710" colourspace="DeviceGray" ncolour="0" size="12.482">tante libertà] </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="161.868,421.239,167.664,433.966" colourspace="DeviceGray" ncolour="0" size="12.727">prima</text>
<text font="NUMPTY+ImprintMTnum" bbox="170.784,425.745,174.262,433.234" colourspace="DeviceGray" ncolour="0" size="7.489">1</text>
<text font="NUMPTY+ImprintMTnum" bbox="174.297,421.228,183.920,433.710" colourspace="DeviceGray" ncolour="0" size="12.482">m</text>
<text font="MUVAOR+Symbol" bbox="194.367,421.612,199.376,431.672" colourspace="DeviceGray" ncolour="0" size="10.060">&lt;&gt;</text>
<text font="NUMPTY+ImprintMTnum" bbox="208.349,425.745,211.827,433.234" colourspace="DeviceGray" ncolour="0" size="7.489">2</text>
<text font="NUMPTY+ImprintMTnum" bbox="244.601,421.228,250.976,433.710" colourspace="DeviceGray" ncolour="0" size="12.482">certe lib</text>
<text font="MUVAOR+Symbol" bbox="250.901,421.612,255.910,431.672" colourspace="DeviceGray" ncolour="0" size="10.060">&lt;</text>
<text font="NUMPTY+ImprintMTnum" bbox="269.331,421.228,274.426,433.710" colourspace="DeviceGray" ncolour="0" size="12.482">ertà</text>
<text font="MUVAOR+Symbol" bbox="274.363,421.612,279.373,431.672" colourspace="DeviceGray" ncolour="0" size="10.060">&gt;</text>
</new_line> <new_line>

相反,工作的第一个XML字符串是这样的:

<new_line>
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="234.782,118.872,239.536,129.558" colourspace="DeviceGray" ncolour="0" size="10.685">80</text>
<text font="NUMPTY+ImprintMTnum" bbox="360.280,118.428,363.184,130.911" colourspace="DeviceGray" ncolour="0" size="12.482">pazienza, e la prudenza.] </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="369.339,118.440,375.135,131.167" colourspace="DeviceGray" ncolour="0" size="12.727">da</text>
<text font="NUMPTY+ImprintMTnum" bbox="113.588,105.629,118.684,118.111" colourspace="DeviceGray" ncolour="0" size="12.482">pa-zienza</text>
<text font="MUVAOR+Symbol" bbox="120.415,105.707,124.422,117.543" colourspace="DeviceGray" ncolour="0" size="11.835">=</text>
</new_line>
<new_line>
<text font="NUMPTY+ImprintMTnum" bbox="194.095,105.629,196.999,118.111" colourspace="DeviceGray" ncolour="0" size="12.482">Cristoforo] </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="214.031,105.640,219.827,118.367" colourspace="DeviceGray" ncolour="0" size="12.727">sts.a</text>
<text font="NUMPTY+ImprintMTnum" bbox="241.600,81.508,247.396,93.991" colourspace="DeviceGray" ncolour="0" size="12.482">Galdino 72</text>
<text font="SZWUPJ+ImprintExpertMT" bbox="272.785,614.422,276.490,625.380" colourspace="DeviceGray" ncolour="0" size="10.958">  </text>
<text font="NUMPTY+ImprintMTnum" bbox="53.923,592.408,58.102,602.646" colourspace="DeviceGray" ncolour="0" size="10.238">34c</text>
<text font="QKWQNQ+ImprintMTnum-Bold" bbox="72.640,592.472,77.394,603.157" colourspace="DeviceGray" ncolour="0" size="10.685">80</text>
<text font="NUMPTY+ImprintMTnum" bbox="187.701,592.028,190.605,604.510" colourspace="DeviceGray" ncolour="0" size="12.482">troverà … immaginare] </text>
<text font="PYNIYO+ImprintMTnum-Italic" bbox="201.265,592.039,204.169,604.766" colourspace="DeviceGray" ncolour="0" size="12.727">da </text>
<text font="NUMPTY+ImprintMTnum" bbox="305.701,592.028,310.796,604.510" colourspace="DeviceGray" ncolour="0" size="12.482">qualche rimedio inaspe</text>
<text font="MUVAOR+Symbol" bbox="310.691,592.412,315.701,602.472" colourspace="DeviceGray" ncolour="0" size="10.060">&lt;</text>
<text font="NUMPTY+ImprintMTnum" bbox="331.518,592.028,337.314,604.510" colourspace="DeviceGray" ncolour="0" size="12.482">ttato</text>
<text font="MUVAOR+Symbol" bbox="337.154,592.412,342.163,602.472" colourspace="DeviceGray" ncolour="0" size="10.060">&gt;</text>
</new_line>

也许这是new_line标签打开和关闭的问题,但我不知道如何解决

错误消息中的术语"垃圾"似乎是一个相当不公平的价值判断;但这意味着解析器希望看到一个顶级元素,当它到达该元素(以及任何尾随注释或PI(的末尾时,它希望看到文件的末尾。如果有另一个元素开始标记,那么它就不是一个格式良好的XML文档。

你说你知道没有根节点,但你似乎没有意识到这会使文档格式不正确。你说它第一次起作用:嗯,它不应该起作用。

相关内容

最新更新