我在C#中有一个应用程序,它使用OpenXML从word(.docx(文件中读取文本。
通常,有一组包含Run元素(r(的段落(p(。我可以使用迭代Run节点
foreach ( var run in para.Descendants<Run>() )
{
...
}
在一份特定的文件中,有一个文本"START",它分为三个部分,"ST"、"AR"one_answers"T"。它们中的每一个都由Run节点定义,但在两种情况下,Run节点包含在"smartTag"节点中。
<w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="PersonName">
<w:r w:rsidRPr="00BF444F">
<w:rPr>
<w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
<w:b/>
<w:bCs/>
<w:sz w:val="40"/>
<w:szCs w:val="40"/>
</w:rPr>
<w:t>ST</w:t>
</w:r>
</w:smartTag>
<w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="PersonName">
<w:r w:rsidRPr="00BF444F">
<w:rPr>
<w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
<w:b/>
<w:bCs/>
<w:sz w:val="40"/>
<w:szCs w:val="40"/>
</w:rPr>
<w:t>AR</w:t>
</w:r>
</w:smartTag>
<w:r w:rsidRPr="00BF444F">
<w:rPr>
<w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
<w:b/>
<w:bCs/>
<w:sz w:val="40"/>
<w:szCs w:val="40"/>
</w:rPr>
<w:t xml:space="preserve">T</w:t>
</w:r>
据我所知,OpenXML不支持smartTag节点。因此,它只生成OpenXmlUnknownElement节点。
使这变得困难的是,它为smartTag的所有派生节点生成OpenXmlUnknownElement节点。这意味着我不能简单地获取第一个子节点并将其强制转换为Run。
获取文本(通过InnerText属性(很容易,但我也需要获取格式信息。
有什么合理简单的方法来处理这个问题吗?
目前,我最好的想法是编写一个预处理器,删除智能标签节点。
编辑
跟进辛迪·梅斯特的评论。
我使用的是OpenXml 2.7.2版本。正如Cindy所指出的,在OpenXML2.0中有一个类SmartTagRun。我不知道那节课。
我在Open XML SDK 2.5 for Office 的页面上找到了以下信息
智能标签
由于Office 2010中不赞成使用智能标记,因此Open XML SDK2.5不支持与智能标记相关的Open XML元素。Open XML SDK 2.5仍然可以将智能标签元素处理为未知元素,但是Open XML SDK 2.5 Office生产力工具会验证Office文档文件中作为无效标记。
因此,听起来可能的解决方案是使用OpenXML2.0。
解决方案是使用Linq-to-XML(或者System.Xml
类,如果你更喜欢的话(来删除w:smartTag
元素,如以下代码所示:
public class SmartTagTests
{
private const string Xml =
@"<w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">
<w:body>
<w:p>
<w:smartTag w:uri=""urn:schemas-microsoft-com:office:smarttags"" w:element=""PersonName"">
<w:r w:rsidRPr=""00BF444F"">
<w:rPr>
<w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
<w:b/>
<w:bCs/>
<w:sz w:val=""40""/>
<w:szCs w:val=""40""/>
</w:rPr>
<w:t>ST</w:t>
</w:r>
</w:smartTag>
<w:smartTag w:uri=""urn:schemas-microsoft-com:office:smarttags"" w:element=""PersonName"">
<w:r w:rsidRPr=""00BF444F"">
<w:rPr>
<w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
<w:b/>
<w:bCs/>
<w:sz w:val=""40""/>
<w:szCs w:val=""40""/>
</w:rPr>
<w:t>AR</w:t>
</w:r>
</w:smartTag>
<w:r w:rsidRPr=""00BF444F"">
<w:rPr>
<w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
<w:b/>
<w:bCs/>
<w:sz w:val=""40""/>
<w:szCs w:val=""40""/>
</w:rPr>
<w:t xml:space=""preserve"">T</w:t>
</w:r>
</w:p>
</w:body>
</w:document>";
[Fact]
public void CanStripSmartTags()
{
// Say you have a WordprocessingDocument stored on a stream (e.g., read
// from a file).
using Stream stream = CreateTestWordprocessingDocument();
// Open the WordprocessingDocument and inspect it using the strongly-
// typed classes. This shows that we find OpenXmlUnknownElement instances
// are found and only a single Run instance is recognized.
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false))
{
MainDocumentPart part = wordDocument.MainDocumentPart;
Document document = part.Document;
Assert.Single(document.Descendants<Run>());
Assert.NotEmpty(document.Descendants<OpenXmlUnknownElement>());
}
// Now, open that WordprocessingDocument to make edits, using Linq to XML.
// Do NOT use the strongly typed classes in this context.
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, true))
{
// Get the w:document as an XElement and demonstrate that this
// w:document contains w:smartTag elements.
MainDocumentPart part = wordDocument.MainDocumentPart;
string xml = ReadString(part);
XElement document = XElement.Parse(xml);
Assert.NotEmpty(document.Descendants().Where(d => d.Name.LocalName == "smartTag"));
// Transform the w:document, stripping all w:smartTag elements and
// demonstrate that the transformed w:document no longer contains
// w:smartTag elements.
var transformedDocument = (XElement) StripSmartTags(document);
Assert.Empty(transformedDocument.Descendants().Where(d => d.Name.LocalName == "smartTag"));
// Write the transformed document back to the part.
WriteString(part, transformedDocument.ToString(SaveOptions.DisableFormatting));
}
// Open the WordprocessingDocument again and inspect it using the
// strongly-typed classes. This demonstrates that all Run instances
// are now recognized.
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false))
{
MainDocumentPart part = wordDocument.MainDocumentPart;
Document document = part.Document;
Assert.Equal(3, document.Descendants<Run>().Count());
Assert.Empty(document.Descendants<OpenXmlUnknownElement>());
}
}
/// <summary>
/// Recursive, pure functional transform that removes all w:smartTag elements.
/// </summary>
/// <param name="node">The <see cref="XNode" /> to be transformed.</param>
/// <returns>The transformed <see cref="XNode" />.</returns>
private static object StripSmartTags(XNode node)
{
// We only consider elements (not text nodes, for example).
if (!(node is XElement element))
{
return node;
}
// Strip w:smartTag elements by only returning their children.
if (element.Name.LocalName == "smartTag")
{
return element.Elements();
}
// Perform the identity transform.
return new XElement(element.Name, element.Attributes(),
element.Nodes().Select(StripSmartTags));
}
private static Stream CreateTestWordprocessingDocument()
{
var stream = new MemoryStream();
using var wordDocument = WordprocessingDocument.Create(stream, WordprocessingDocumentType.Document);
MainDocumentPart part = wordDocument.AddMainDocumentPart();
WriteString(part, Xml);
return stream;
}
#region Generic Open XML Utilities
private static string ReadString(OpenXmlPart part)
{
using Stream stream = part.GetStream(FileMode.Open, FileAccess.Read);
using var streamReader = new StreamReader(stream);
return streamReader.ReadToEnd();
}
private static void WriteString(OpenXmlPart part, string text)
{
using Stream stream = part.GetStream(FileMode.Create, FileAccess.Write);
using var streamWriter = new StreamWriter(stream);
streamWriter.Write(text);
}
#endregion
}
您还可以使用PowerTools for Open XML,它提供了一个直接支持删除w:smartTag
元素的标记简化器。
您可以使用e.LocalName == "smartTag"
检查smartTag,使用e.LocalName == "r"
检查Run。
if (child.LocalName == "smartTag") {
void f(OpenXmlElement e)
{
if (e == null) return;
if (e.LocalName == "r") {
var r = new Run();
r.InnerXml = e.InnerXml;
// process the run
ProcessRun(r);
return;
}
if (!e.HasChildren) return;
foreach (var ele in e.ChildElements)
{
f(ele);
}
}
f(child);
}