当字符串为多行时，如何在Python中编辑两个字符串分隔符之间的文本

我正在将XML文档转换为.ckl文档。它们是类似的文件格式，但并没有那么简单。我大部分都在工作，但有一部分我被卡住了。

在使用ElementTree解析XML之前，我必须将一些<和>转换为<和>，因为原始XML有一些错误，需要更正才能正确解析。有一件事我没有意识到，在一些组中，我需要离开<和>，因为.ckl阅读器程序将该文本显示为<和>

基本上，为了能够进行解析，我进行了过度更正，但当它们在<fixtext>组中时，需要将一些更改回来。

为了进行初始更正，我将整个XML文件作为一个大字符串复制到一个变量中，并执行data.replace('<', '<')。这很好，可以替换所有所需的实例，但它也更正了我需要离开<的情况

在这之后，我需要在解析之前将<fixtext>组中的少数情况更改回来，否则就会一团糟

TL；DR I需要替换多行字符串中分隔符<fixtest *tags here*>和</fixtext>之间的<和>，其中行数改变

如有任何帮助，我们将不胜感激。如果你需要更多信息，请告诉我，我很乐意回答任何

原始XML关闭的示例：

<description>&lt;VulnDiscussion&gt;

在这里，VulnDiscussion应该是一个新的标签

正在启动修复文本：

<fixtext fixref="F-22407r554595_fix">Configure the policy value for Computer Configuration &gt;&gt;
Administrative Templates &gt;&gt; Windows Components &gt;&gt; BitLocker Drive Encryption &gt;&gt;
Operating System Drives "Require additional authentication at startup" to "Enabled" with "Configure TPM
Startup PIN:" set to "Require startup PIN with TPM" or with "Configure TPM startup key and PIN:" set to
"Require startup key and PIN with TPM".
</fixtext>

使用正则表达式

import re
import html # In Python 3.2 a new html module was introduced, which is used for escaping reserved characters from HTML markup
# Example html Text with &lt; and &gt; between and outside tags
html_doc = '''&gt;&gt;&lt;&lt;&lt;&gt;&gt;&lt;&lt;&lt;blahblah<fixtext fixref="F-22407r554595_fix">Configure the policy value for Computer Configuration &gt;&gt;
Administrative Templates &gt;&gt; Windows Components &gt;&gt; BitLocker Drive Encryption &gt;&gt;
Operating System Drives "Require additional authentication at startup" to "Enabled" with "Configure TPM
Startup PIN:" set to "Require startup PIN with TPM" or with "Configure TPM startup key and PIN:" set to
"Require startup key and PIN with TPM".
</fixtext>&gt;&gt;&lt;&lt;&lt;blahblah'''

# Generate text with substitutions OP wants to reverse later on all the text
html_doc = html_doc.replace('&gt;', '>').replace('&lt;', '<')
# Regex pattern for detecting charcters between tags
p = re.compile(r"(?P<TAG_START><fixtext[^>]*>)(?P<TEXT>.*?)(?P<TAG_END></fixtext>)", flags = re.DOTALL)
indexes = p.groupindex     # groupindex on a compiled regular expression which prints the groups and their orders in the pattern string
# i.e. mappingproxy({'TAG_START': 1, 'TEXT': 2, 'TAG_END': 3}

# Only escape characters between tags (DOTALL flag for multiline)
corrected = re.sub(pattern, 
lambda m: m.group(indexes["TAG_START"]) + html.escape(m.group(indexes["TEXT"])) + m.group(indexes["TAG_END"]), 
html_doc) 
print(corrected)

已更正的注释已替换<并且>仅在标签之间

>><<<>><<<blahblah<fixtext fixref="F-22407r554595_fix">Configure the policy value for Computer Configuration &gt;&gt;
Administrative Templates &gt;&gt; Windows Components &gt;&gt; BitLocker Drive Encryption &gt;&gt;
Operating System Drives &quot;Require additional authentication at startup&quot; to &quot;Enabled&quot; with &quot;Configure TPM
Startup PIN:&quot; set to &quot;Require startup PIN with TPM&quot; or with &quot;Configure TPM startup key and PIN:&quot; set to
&quot;Require startup key and PIN with TPM&quot;.
</fixtext>>><<<blahblah

相关内容

最新更新

热门标签：