你好,这是我从电子邮件中提取的编码xml
<?xml version="1.0" encoding="utf-8"?>
<message_root>
<message>
<to>
<displayName>abc</displayName>
<email>abc</email>
<name>abc</name>
</to>
<from>
<displayName>abc</displayName>
<email>abc</email>
<name>abc</name>
</from>
<return-path>abc</return-path>
<date>abc</date>
<subject>abc</subject>
<mime-version>1.0</mime-version>
<message-id><abc></message-id>
<body_html><html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" id="owaParaStyle"></style>
</head>
<body fpstyle="1" ocsi="0">
<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">Hello alfjskfslfkjsjsf
<div>Attr A: Hello my name is</div>
<div>Attr B: ABCXYZ</div>
<div>Attr C: 5</div>
<div>Attr D: Mr.ABC</div>
<div>Thank you so much</div>
</div>
</body>
</html>
</body_html>
<body_text />
</message>
</message_root>
我想要的xml
<?xml version="1.0" encoding="utf-8"?>
<message_root>
<message>
<to>
<displayName>abc</displayName>
<email>abc</email>
<name>abc</name>
</to>
<from>
<displayName>abc</displayName>
<email>abc</email>
<name>abc</name>
</from>
<return-path>abc</return-path>
<date>abc</date>
<subject>abc</subject>
<mime-version>1.0</mime-version>
<message-id>abc</message-id>
<body_html>
<AttrA> Hello my name is </AttrA>
<AttrB> ABCXYZ </AttrB>
<AttrC> 5 </AttrC>
<AttrD> Mr.ABC </AttrD>
</body_html>
<body_text />
</message>
</message_root>
我使用这个xslt 3.0来使用parse xml来解码body_html部分
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" />
<xsl:template match="message_root">
<message_root>
<xsl:apply-templates select="message" />
</message_root>
</xsl:template>
<xsl:template match="message">
<message>
<xsl:apply-templates select="body_text" />
<datasource>Inbox</datasource>
<source>Test</source>
<xsl:copy-of select="subject" />
<xsl:copy-of select="date" />
<xsl:copy-of select="from" />
<xsl:copy-of select="to" />
<xsl:copy-of select="parse-xml(body_html)" />
<messageid>
<xsl:value-of select="substring-before(translate(translate(message-id,'<',''),'>',''),'@')" />
</messageid>
<xsl:variable name="div" select="html/body/div/div" />
<AttrA>
<xsl:value-of select="substring-after($div[starts-with(., 'Attr A:')], ':')" />
</AttrA>
</message>
</xsl:template>
</xsl:stylesheet>
但是AttrA返回空值。如何获得我想要的xml?非常感谢。
我必须添加这个部分才能获得足够的字符
<xsl:copy-of select="parse-xml(body_html)" />
这会导致错误,因为body_html
元素的字符串值不是格式良好的XML文档。
一个错误的地方是<meta>
元素只有一个开始标签,但没有结束标签:
<meta-http-equiv=";内容类型";content=";text/html;charset=utf-8">
希望这能解释问题。
一种可能的解决方案:
您可能想要使用EXPath,它提供了一个解析HTML的模块。
如果meta
元素的问题如下所示得到修复(为了可读性,文本显示在CDATA部分(:
<message_root>
<message>
<to>
<displayName>abc</displayName>
<email>abc</email>
<name>abc</name>
</to>
<from>
<displayName>abc</displayName>
<email>abc</email>
<name>abc</name>
</from>
<return-path>abc</return-path>
<date>abc</date>
<subject>abc</subject>
<mime-version>1.0</mime-version>
<message-id><abc></message-id>
<body_html>
<![CDATA[
<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<style type="text/css" id="owaParaStyle"></style>
</head>
<body fpstyle="1" ocsi="0">
<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">Hello alfjskfslfkjsjsf
<div>Attr A: Hello my name is</div>
<div>Attr B: ABCXYZ</div>
<div>Attr C: 5</div>
<div>Attr D: Mr.ABC</div>
<div>Thank you so much</div></div>
</body>
</html>]]>
</body_html>
<body_text />
</message>
</message_root>
并相应更新转换:
<xsl:stylesheet version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" />
<xsl:template match="message_root">
<message_root>
<xsl:apply-templates select="message" />
</message_root>
</xsl:template>
<xsl:template match="message">
<message>
<xsl:apply-templates select="body_text" />
<datasource>Inbox</datasource>
<source>Test</source>
<xsl:copy-of select="subject" />
<xsl:copy-of select="date" />
<xsl:copy-of select="from" />
<xsl:copy-of select="to" />
<xsl:variable name="hDoc" select="parse-xml(body_html)"/>
<xsl:copy-of select="$hDoc/*" />
<messageid>
<xsl:value-of select="substring-before(translate(translate(message-id,'<',''),'>',''),'@')" />
</messageid>
<xsl:variable name="div" select="$hDoc/html/body/div/div" />
<AttrA>
<xsl:value-of select="substring-after($div[starts-with(., 'Attr A:')], ':')" />
</AttrA>
</message>
</xsl:template>
</xsl:stylesheet>
然后产生看似想要的结果:
<?xml version="1.0" encoding="UTF-8"?>
<message_root>
<message>
<datasource>Inbox</datasource>
<source>Test</source>
<subject>abc</subject>
<date>abc</date>
<from>
<displayName>abc</displayName>
<email>abc</email>
<name>abc</name>
</from>
<to>
<displayName>abc</displayName>
<email>abc</email>
<name>abc</name>
</to>
<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<style type="text/css" id="owaParaStyle"/>
</head>
<body fpstyle="1" ocsi="0">
<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">Hello alfjskfslfkjsjsf
<div>Attr A: Hello my name is</div>
<div>Attr B: ABCXYZ</div>
<div>Attr C: 5</div>
<div>Attr D: Mr.ABC</div>
<div>Thank you so much</div>
</div>
</body>
</html>
<messageid/>
<AttrA> Hello my name is</AttrA>
</message>
</message_root>
您询问如何解析编码的XML,但您的输入包含的编码HTML不是格式良好的XML。无法使用XSLT 3.0parse-xml()
函数对其进行解析。
在没有HTML解析器(例如https://www.saxonica.com/documentation11/index.html#!函数/saxon/parse-html(,则需要采用一种更原始的方法:
<xsl:stylesheet version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="message">
<xsl:copy>
<xsl:copy-of select="* except (body_html | message-id)"/>
<messageid>
<xsl:value-of select="substring-before(substring-after(message-id, '<'), '>')" />
</messageid>
<body_html>
<AttrA>
<xsl:value-of select="substring-before(substring-after(body_html, '<div>Attr A:'), '</div>')" />
</AttrA>
<!-- ... -->
</body_html>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
如果您能够提供包含表示格式良好的XML的字符串的输入,例如:
<message_root>
<message>
<to>
<displayName>abc</displayName>
<email>abc</email>
<name>abc</name>
</to>
<from>
<displayName>abc</displayName>
<email>abc</email>
<name>abc</name>
</from>
<return-path>abc</return-path>
<date>abc</date>
<subject>abc</subject>
<mime-version>1.0</mime-version>
<message-id><abc></message-id>
<body_html><html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<style type="text/css" id="owaParaStyle"></style>
</head>
<body fpstyle="1" ocsi="0">
<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">Hello alfjskfslfkjsjsf
<div>Attr A: Hello my name is</div>
<div>Attr B: ABCXYZ</div>
<div>Attr C: 5</div>
<div>Attr D: Mr.ABC</div>
<div>Thank you so much</div>
</div>
</body>
</html>
</body_html>
<body_text />
</message>
</message_root>
那么你可以做:
<xsl:stylesheet version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="message">
<xsl:copy>
<xsl:copy-of select="* except (body_html | message-id)"/>
<messageid>
<xsl:value-of select="substring-before(substring-after(message-id, '<'), '>')" />
</messageid>
<body_html>
<xsl:variable name="div" select="parse-xml(body_html)/html/body/div/div" />
<AttrA>
<xsl:copy-of select="substring-after($div[starts-with(., 'Attr A:')], 'Attr A:')" />
</AttrA>
<!-- ... -->
</body_html>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>