如何在Python中循环文件并删除文件的部分?



我有一个如下的数据结构:

<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
<recording audio="audio.wav" name="first audio">
<segment name="1" start="0" end="2">
<orth>some text 1</orth>
</segment>
<segment name="2" start="2" end="4">
<orth>some text 2</orth>
</segment>
<segment name="3" start="4" end="6">
<orth>some text 3</orth>
</segment>
</recording>
</corpus>

给定一个包含多个文件的输入文件,例如

1
3

,它将删除具有这些name的片段。例如,给定了1和3,因此删除了名称为1和3的段。

<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
<recording audio="audio.wav" name="first audio">
<segment name="2" start="2" end="4">
<orth>some text 2</orth>
</segment>
</recording>
</corpus>

到目前为止的代码

with open('file.txt', 'r') as inputFile:
w_file = inputFile.readlines()
w_file = w_file.strip('n')
with open('to_delete_nums.txt', 'r') as File:
d_file = deleteFile.readlines()
d_file = d_file.strip('n')
for line in w_file:
if line.contains("<segment name"):
for d in d_file:
//if segment name is equal to d then delete that segment.

我如何做到这一点?我也认为有2个可能是不必要的,对吗?

方法1(带模块):

就像@iain-shelvington说的XML解析/操作库一样,你可以简单快速地做到这一点;

试试lxmlxpath:

import lxml.etree as et
xml = """<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
<recording audio="audio.wav" name="first audio">
<segment name="1" start="0" end="2">
<orth>some text 1</orth>
</segment>
<segment name="2" start="2" end="4">
<orth>some text 2</orth>
</segment>
<segment name="3" start="4" end="6">
<orth>some text 3</orth>
</segment>
</recording>
</corpus>"""
tree = et.XML(xml.encode())
find_segments = tree.xpath("*//segment[@name='1' or @name='2']") # you can add more segments here
for each_segment in find_segments:
each_segment.getparent().remove(each_segment)
clean_content = str(et.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")
print(clean_content)

感谢@ csamdric -julien, @Sheena, @xyz, @josh-allemon和这些问题:

  1. 如何删除lxml中的元素
  2. 在Xpath中使用OR条件来标识相同的元素
  3. lxml.etree.XML Unicode字符串ValueError

方法二(硬编码):

xml = """<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
<recording audio="audio.wav" name="first audio">
<segment name="1" start="0" end="2">
<orth>some text 1</orth>
</segment>
<segment name="2" start="2" end="4">
<orth>some text 2</orth>
</segment>
<segment name="3" start="4" end="6">
<orth>some text 3</orth>
</segment>
</recording>
</corpus>"""
lines = []
toggle = True
for each_line in xml.splitlines():
if each_line.strip().startswith("<segment") and ('name="1"' in each_line or 'name="2"' in each_line):
toggle = False
elif each_line.strip().startswith("</segment>") and toggle is False:
toggle = True
elif toggle:
lines.append(each_line)
new_xml = "n".join(lines)
print(new_xml)

如果你想从文件中读取名字,那么试试这个:

from lxml import etree
with open("xml.txt", "r") as xml_file:
xml_data = xml_file.read()
with open('nums.txt', 'r') as file:
list_of_names = file.read().split("n")
new_xml = xml_data
for each_name in list_of_names:
tree = etree.XML(new_xml.encode())
find_segments = tree.xpath("*//segment[@name='{}']".format(each_name))
for each_segment in find_segments:
each_segment.getparent().remove(each_segment)
new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")
print(new_xml)

更短:

from lxml import etree
with open("xml.txt", "r") as xml_file:
tree = etree.XML(xml_file.read().encode())
with open('nums.txt', 'r') as file:
list_of_names = list(set(file.read().split("n")))
xpath = "*//segment[{}]".format(" or ".join(["@name='{}'".format(each_name) for each_name in list_of_names]))
print(xpath)
for each_segment in tree.xpath(xpath):
each_segment.getparent().remove(each_segment)
new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")
print(new_xml)

最新更新