XML:删除不需要的标签,但保留文本内容



我正在整理一个有太多标签的语料库。要做到这一点,我想过滤掉/删除无用的标签,但保留文本内容。我是xml的新手,我尝试过的代码都不起作用。语料库看起来像这样:

<corpus>
<dialogue speaker="A">
<sentence tag1="a" tag2="b"> Hello </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="cc" tag2= "dd"> How are you </sentence>
<sentence tag1="ff" tag2= "e"> today </sentence>
</dialogue>
<dialogue speaker="A">
<sentence tag1="d" tag2= "bbb"> Great </sentence>
<sentence tag1="f" tag2= "dd"> How about you </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="a" tag2= "dd"> me too </sentence>
</dialogue>
</corpus>

理想的结果应该是:

<corpus>
<dialogue speaker="A">
<sentence tag1="a" tag2="b"> Hello </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="cc" tag2= "dd"> How are you today </sentence>
</dialogue>
<dialogue speaker="A">
Great How about you
</dialogue>
<dialogue speaker="B">
<sentence tag1="a" tag2= "dd"> me too </sentence>
</dialogue>
</corpus>

我尝试的第一个代码是这样的,但它一直给我一个错误的strip_tags():

f = ET.parse("file.xml")
root = f.getroot()
def filter_by(f, tag_list):
for elem in root.iter('dialogue'):
for start in elem.iter('sentence'):
print(sentence.attrib)
if tag_list in root.findall('.//sentence[@tag1]'):
pass
else:
etree.strip_tags(f, 'sentence')
return f
filter_by(f, ["a"])
f.write("output.xml")

由于我需要保留多个标签,我尝试的另一个选项是这个,但它仍然在if语句中给了我一个错误:

f = ET.parse("file.xml")
root = f.getroot()
tags_want = ["a", "cc"]
for child in root.iter('sentence'):
attrib = child.get("tag1")
if attrib not in tags_want: 
etree.strip_tags(f,'sentence')
f.write("output.xml")

有人能帮帮我吗?

我会用下面两种方法中的一种来做。首先,像前面一样使用ElementTree和xpath:

for dia in root.findall('.//dialogue'):
if len(dia.findall('./sentence'))>1:
new_text = "".join([t.text for t in dia.findall('.//sentence')])
dia.find('.//sentence').text=new_text
for to_delete in dia.findall('./sentence')[1:]:
to_delete.clear()
print(ET.tostring(root).decode())

第二,虽然在示例xml的情况下可能没有太大的区别,但我将使用lxml而不是ElementTree,因为前者具有更好的xpath支持:

from lxml import etree
root = etree.parse('file.xml')
for dia in root.xpath('//dialogue'):
if (dia.xpath('count(./sentence)'))>1:
new_text = "".join(dia.xpath('.//sentence//text()')).strip()
dia.xpath('.//sentence')[0].text=new_text
for to_delete in dia.xpath('.//sentence[position()>1]'):
to_delete.getparent().remove(to_delete)    
print(etree.tostring(root).decode())

无论哪种情况,输出都应该是

<corpus>
<dialogue speaker="A">
<sentence tag1="a" tag2="b"> Hello </sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="cc" tag2="dd">How are you  today</sentence>
</dialogue>
<dialogue speaker="A">
<sentence tag1="d" tag2="bbb">Great  How about you</sentence>
</dialogue>
<dialogue speaker="B">
<sentence tag1="a" tag2="dd"> me too </sentence>
</dialogue>
</corpus>

最新更新