如何通过 Python 中的 LXML 引用父元素并删除 RSS XML 中的父元素



我一直在破解这个。我有一个XML文件形式的RSS提要。简化,它看起来像这样:

<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description></description>
<item>...</item>
<item>...</item>
<item>...</item>
<item>
<guid></guid>
<pubDate></pubDate>
<author/>
<title>Title of the item</title>
<link href="https://example.com" rel="alternate" type="text/html"/>
<description>
<![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
</description>
<description>
<![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
</description>
</item>
<item>...</item>
</channel>
</rss>

我的目标是检查第二个描述标签是否包含某些字符串。如果它确实包含该字符串,我想完全删除它。目前在我的代码中,我有这个:

doc = lxml.etree.fromstring(testString)
found = doc.findall('channel/item/description')

for desc in found:
if "FORBIDDENSTRING" in desc.text:
desc.getparent().remove(desc)

它只删除了第二个有意义的描述标签,但我希望整个项目都消失了。 我不知道如果我只有"desc"引用,我如何掌握"项目"元素。

我已经尝试过谷歌搜索以及在这里搜索,但我看到的情况只是想像我现在一样删除标签,奇怪的是我没有偶然发现想要摆脱整个父对象的示例代码。 非常欢迎任何指向文档/教程或帮助的指针。

我是 XSLT 的忠实粉丝,但另一种选择是只选择item而不是description(选择要删除的元素;而不是它的子元素)。

此外,如果使用xpath(),则可以将禁止字符串的检查直接放在 xpath 谓词中。

例。。。

from lxml import etree
testString = """
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description></description>
<item>...</item>
<item>...</item>
<item>...</item>
<item>
<guid></guid>
<pubDate></pubDate>
<author/>
<title>Title of the item</title>
<link href="https://example.com" rel="alternate" type="text/html"/>
<description>
<![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]>
</description>
<description>
<![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
</description>
</item>
<item>...</item>
</channel>
</rss>
"""
forbidden_string = "I want to get rid of the whole item"
parser = etree.XMLParser(strip_cdata=False)
doc = etree.fromstring(testString, parser=parser)
found = doc.xpath('.//channel/item[description[contains(.,"{}")]]'.format(forbidden_string))
for item in found:
item.getparent().remove(item)
print(etree.tostring(doc, encoding="unicode", pretty_print=True))

这打印...

<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description/>
<item>...</item>
<item>...</item>
<item>...</item>
<item>...</item>
</channel>
</rss>

请考虑 XSLT,这是一种专用语言,旨在转换 XML 文件,例如按值有条件地删除节点。Python 的lxml可以运行 XSLT 1.0 脚本,甚至可以将参数从 Python 脚本传递到 XSLT(与在 SQL 中传递参数没有什么不同!)。通过这种方式,您可以避免任何for循环或if逻辑或在应用程序层重建树。

XSLT (另存为 .xsl 文件,一个特殊的.xml文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" cdata-section-elements="description"/>
<xsl:strip-space elements="*"/>
<!-- VALUE TO BE PASSED INTO FROM PYTHON -->
<xsl:param name="search_string" />       
<!-- IDENTITY TRANSFORM -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- KEEP ONLY item NODES THAT DO NOT CONTAIN $search_string -->
<xsl:template match="channel">
<xsl:copy>
<xsl:apply-templates select="item[not(contains(description[2], $search_string))]"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

Python(对于演示,下面使用发布的示例运行两个搜索)

import lxml.etree as et
# LOAD XML AND XSL
doc = et.parse('Input.xml')
xsl = et.parse('XSLT_String.xsl')
# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)    
# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('FORBIDDENSTRING')
result = transform(doc, search_string=n)
print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
#   <channel>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#     <item>
#       <guid/>
#       <pubDate/>
#       <author/>
#       <title>Title of the item</title>
#       <link href="https://example.com" rel="alternate" type="text/html"/>
#       <description><![CDATA[<a href="https://example.com" target="_blank" rel="noopener noreferrer">View Example</a>]]></description>
#       <description><![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]></description>
#     </item>
#     <item>...</item>
#   </channel>
# </rss>
# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('bunch of text')
result = transform(doc, search_string=n)
print(result)    
# <?xml version="1.0"?>
# <rss version="2.0">
#   <channel>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#     <item>...</item>
#   </channel>
# </rss>
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)

最新更新