使用 perl LibXML Element->getAttribute() 而不扩展值中的 unicode 实体

我当前正在尝试创建一个使用libxml在SVG字体中处理数据的Perl脚本。

在SVG字体中，每个字符被定义为具有Unicode属性的字形元素，该元素以Unicode Entity的形式定义其Unicode地址；喜欢：

<glyph unicode="&#x2000;" />

我想要做的一部分是拿走每个字形元素的Unicode属性的值，然后像字符串一样处理。但是，当我使用元素 -> getAttribute（'unicode'）时；在字形节点上，它返回了作为占位符矩形显示的"宽字符"，使我相信它将Unicode实体扩展到Unicode角色并返回。

当我创建解析器时，我将Expand_entities设置为0，因此我不确定我还能做些什么来防止这种情况。我对XML处理是新的，所以我不确定我实际上了解发生了什么，或者是否应该可以预防。

这是一个代码样本：

use utf8;
use open ':std', ':encoding(UTF-8)';
use strict;
use warnings;
use XML::LibXML;
$XML::LibXML::skipXMLDeclaration = 1;
my $xmlFile = $ARGV[0];
my $parser = XML::LibXML->new();
$parser->load_ext_dtd(0);
$parser->validation(0);
$parser->no_network(1);
$parser->recover(1);
$parser->expand_entities(0);
my $xmlDom = $parser->load_xml(location => $xmlFile);
my $xmlDomSvg = XML::LibXML::XPathContext->new();
$xmlDomSvg->registerNs('svg', 'http://www.w3.org/2000/svg');
foreach my $myGlyph ($xmlDomSvg->findnodes('/svg:svg/svg:defs/svg:font/svg:glyph', $xmlDom))
{
  my $myGlyphCode = $myGlyph->getAttribute('unicode');
  print $myGlyphCode . "n";
}

注意：如果我运行打印$ myglyph-> toString（）;，输出中的Unicode实体不会扩展，因此为什么我得出结论，在getAttribute方法中进行了扩展。

这可能不是您要寻找的答案，但是IMHO getAttribute为您提供了足够的信息，即Perl字符串，以其他方式解决您的问题。您正在尝试将Perl字符串写入非UTF8文件，这就是为什么您会获取"宽字符"警告。

一个剥离的示例，说明如何获得您要寻找的U+xxxx值：

use strict;
use warnings;
use open qw(:encoding(UTF-8) :std);
use XML::LibXML;
my $dom = XML::LibXML->load_xml(IO => *DATA)
    or die "XMLn";
my $root = $dom->documentElement();
print $root->toString(), "n";
my $attr = $root->getAttribute('unicode');
printf("'%s' is %d (U+%04X)n", $attr, ord($attr), ord($attr));
exit 0;
__DATA__
<glyph unicode="&#x2000;" />

测试运行：

$ perl dummy.pl
<glyph unicode="&#x2000;"/>
' ' is 8192 (U+2000)

更新： expand_entities的文档是iMho的误导。它谈论了"实体"，但显然是指ENTITY定义，即文档中引入的新实体。不幸的是，LIBXML2文档还不清楚。但是，这个旧的消息似乎表明您所描述的行为是预期的，即。XML解析器应始终替换预定义的实体：

#!/usr/bin/perl
use warnings;
use strict;
use XML::LibXML;
my $parser = XML::LibXML->new({
    expand_entities => $ARGV[0] ? 1 : 0,
});
my $dom = $parser->load_xml(IO => *DATA)
    or die "XMLn";
my $root = $dom->documentElement();
print "toString():  ", $root->toString(), "n";
print "textContent: ", $root->textContent(), "n";
my $attr = $root->getAttribute('test');
print "attribute:   ${attr}n";
exit 0;
__DATA__
<?xml version="1.0"?>
<!DOCTYPE foo [
<!ENTITY author "Fluffy Bunny">
]>
<tag test="&lt;&author;&gt;">&lt;&author;&gt;</tag>

测试运行：

$ perl dummy.pl 0
toString():  <tag test="&lt;&author;&gt;">&lt;&author;&gt;</tag>
textContent: <Fluffy Bunny>
attribute:   <Fluffy Bunny>
$ perl dummy.pl 1
toString():  <tag test="&lt;Fluffy Bunny&gt;">&lt;Fluffy Bunny&gt;</tag>
textContent: <Fluffy Bunny>
attribute:   <Fluffy Bunny>

serializecontent（）方法可能会做您追随的事情：

my $xml = '<doc>
  <glyph unicode="&#x2000;" />
</doc>';
my $dom = XML::LibXML->load_xml(
    string          => $xml,
    expand_entities => 0,
    no_network      => 1,
);
my($attr) = $dom->findnodes('//glyph[1]/@unicode');
say $attr->serializeContent();

输出：

&#x2000;

我怀疑，expand_entities选项不适用于数字字符实体。该文档尚不清楚，我还没有查看来源。

在更常见的情况下，您 do 希望所有实体都扩展，只需要这些实体代表的实际字符，您甚至不需要调用getAttribute()。每个节点对象都使用绑定的哈希接口，因此您可以做到这一点：

my $text = $glyph->{unicode};

相关内容

最新更新

热门标签：