我正在读取XML文件并对其进行解析。我想从作为XML文档一部分的每个img
元素中删除width
属性。
我如何解析这个HTML文件和搜索图像标签和更新它,并返回更新后的HTML?
下面是XML示例。在描述标签想要删除img attr
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
>
<channel>
<title></title>
<atom:link href="" rel="self" type="application/rss+xml" />
<link></link>
<description>Award-Winning Impact Media - Alt Protein &
Sustainability Breaking News</description>
<lastBuildDate>Sun, 24 Oct 2021 19:27:36 +0000</lastBuildDate>
<language>en-US</language>
<sy:updatePeriod>
hourly </sy:updatePeriod>
<sy:updateFrequency>
1 </sy:updateFrequency>
<item>
<title>Vegan .</title>
<link>https://www.google.com</link>
<dc:creator><![CDATA[Sally Ho]]></dc:creator>
<pubDate>Mon, 25 Oct 2021 00:00:00 +0000</pubDate>
<category><![CDATA[Alt Protein]]></category>
<category><![CDATA[Seafood]]></category>
<category><![CDATA[Vegan]]></category>
<category><![CDATA[alternative seafood]]></category>
<category><![CDATA[plant based tuna]]></category>
<category><![CDATA[vegan seafood]]></category>
<category><![CDATA[vegan tuna]]></category>
<guid isPermaLink="false">https://www.google.com/?p=55401</guid>
<description><![CDATA[<div style="margin-
bottom:20px;">
<img width="1024" height="768" src="" class="attachment-post-
thumbnail size-post-thumbnail wp-post-image" alt="" srcset=""
sizes="
(max-width: 1024px) 100vw, 1024px" /></div>
<p><span class="rt-reading-time" style="display: block;"><span
class="rt-label rt-prefix"></span> <span class="rt-
time">4</span>
<span class="rt-label rt-postfix"> Mins Read</span></span>
</p>
<p>The post <a rel="nofollow"
href="https://www.greenqueen.com.hk/vegan-tuna-brands/"> <a
rel="nofollow" href="">Green</a>.</p>
]]></description>
</item>
getElementsByTagName
将不会在这种情况下工作,因为html包含在CDATA(字符数据)部分。这些部分的内容不被XML解析器解析,也不被视为标记。
参见W3C参考:
- https://www.w3.org/TR/REC-xml/sec-cdata-sect
- https://www.w3.org/TR/REC-xml/dt-chardata
如果它们包含HTML,在香草javascript中操作它们的最简单方法是从nodeValue
属性读取内容,并将其设置为附加到新DocumentFragment的临时元素的innerHTML
属性。这使您能够使用document
中常见的方法来使用选择器搜索元素。(removeWidthFromCDATA
函数用于下面的代码片段)。
const xmlStr = `<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
>
<channel>
<title></title>
<atom:link href="" rel="self" type="application/rss+xml" />
<link></link>
<description>Award-Winning Impact Media - Alt Protein &
Sustainability Breaking News</description>
<lastBuildDate>Sun, 24 Oct 2021 19:27:36 +0000</lastBuildDate>
<language>en-US</language>
<sy:updatePeriod>
hourly </sy:updatePeriod>
<sy:updateFrequency>
1 </sy:updateFrequency>
<item>
<title>Vegan .</title>
<link>https://www.google.com</link>
<dc:creator><![CDATA[Sally Ho]]></dc:creator>
<pubDate>Mon, 25 Oct 2021 00:00:00 +0000</pubDate>
<category><![CDATA[Alt Protein]]></category>
<category><![CDATA[Seafood]]></category>
<category><![CDATA[Vegan]]></category>
<category><![CDATA[alternative seafood]]></category>
<category><![CDATA[plant based tuna]]></category>
<category><![CDATA[vegan seafood]]></category>
<category><![CDATA[vegan tuna]]></category>
<guid isPermaLink="false">https://www.google.com/?p=55401</guid>
<description><![CDATA[<div style="margin-
bottom:20px;">
<img width="1024" height="768" src="" class="attachment-post-
thumbnail size-post-thumbnail wp-post-image" alt="" srcset=""
sizes="
(max-width: 1024px) 100vw, 1024px" /></div>
<p><span class="rt-reading-time" style="display: block;"><span
class="rt-label rt-prefix"></span> <span class="rt-
time">4</span>
<span class="rt-label rt-postfix"> Mins Read</span></span>
</p>
<p>The post <a rel="nofollow"
href="https://www.greenqueen.com.hk/vegan-tuna-brands/"> <a
rel="nofollow" href="">Green</a>.</p>
]]></description>
</item>
</channel>
</rss>`;
function removeWidthFromCDATA(cdataString) {
const fragment = document.createDocumentFragment();
const html = document.createElement("div");
html.innerHTML = cdataString;
fragment.appendChild(html);
const images = fragment.querySelectorAll("img");
for (const image of images)
if (image.hasAttribute("width"))
image.removeAttribute("width");
return html.innerHTML;
}
function cleanRSS(xmlString) {
const parser = new DOMParser();
const doc = parser.parseFromString(xmlString, "application/xml");
const errorNode = doc.querySelector("parsererror");
if (errorNode) {
console.error("Error parsing xml string.")
return false;
}
const descs = doc.querySelectorAll("description");
for (const desc of descs) {
let content = desc.firstChild;
if (!content || content.nodeType !== Node.CDATA_SECTION_NODE)
continue;
content.nodeValue = removeWidthFromCDATA(content.nodeValue);
}
const serializer = new XMLSerializer();
return serializer.serializeToString(doc);
}
console.log(cleanRSS(xmlStr));
作为旁注:注意用于生成RSS提要的模板。如果您有控制权,则需要修复缺少的<a>
标记的关闭。注意换行符前的空格。因为它们中断了应用于CDATA节内元素的CSS类和属性的名称。
如果您可以提供示例输入和输出,我们可以扩展此代码段并使其可运行。
// `xml` is the xml document
const imgs = xml.getElementsByTagName("img");
for (const img of imgs) {
if (img.hasAttribute("width")) {
img.removeAttribute("width");
}
}
// output xml/html string from `xml`