我有一个RDF/XML数据,我想解析并访问该节点。它看起来像这样:
<!-- http://purl.obolibrary.org/obo/VO_0000185 -->
<owl:Class rdf:about="&obo;VO_0000185">
<rdfs:label>Influenza virus gene</rdfs:label>
<rdfs:subClassOf rdf:resource="&obo;VO_0000156"/>
<obo:IAO_0000117>YH</obo:IAO_0000117>
</owl:Class>
<!-- http://purl.obolibrary.org/obo/VO_0000186 -->
<owl:Class rdf:about="&obo;VO_0000186">
<rdfs:label>RNA vaccine</rdfs:label>
<owl:equivalentClass>
<owl:Class>
<owl:intersectionOf rdf:parseType="Collection">
<rdf:Description rdf:about="&obo;VO_0000001"/>
<owl:Restriction>
<owl:onProperty rdf:resource="&obo;BFO_0000161"/>
<owl:someValuesFrom rdf:resource="&obo;VO_0000728"/>
</owl:Restriction>
</owl:intersectionOf>
</owl:Class>
</owl:equivalentClass>
<rdfs:subClassOf rdf:resource="&obo;VO_0000001"/>
<obo:IAO_0000116>Using RNA may eliminate the problem of having to tailor a vaccine for each individual patient with their specific immunity. The advantage of RNA is that it can be used for all immunity types and can be taken from a single cell. DNA vaccines need to produce RNA which then prompts the manufacture of proteins. However, RNA vaccine eliminates the step from DNA to RNA.</obo:IAO_0000116>
<obo:IAO_0000115>A vaccine that uses RNA(s) derived from a pathogen organism.</obo:IAO_0000115>
<obo:IAO_0000117>YH</obo:IAO_0000117>
</owl:Class>
完整的RDF/XML文件可以在这里找到。
我想做的是做以下事情:
- 查找包含条目
<rdfs:subClassOf rdf:resource="&obo;VO_0000001"/>
的区块 - 访问
<rdfs:label>...</rdfs:label>
定义的文字术语
因此,在上面的示例中,代码将通过第二个块并输出:"RNA疫苗"。
我目前被以下代码卡住了。在我做不到的地方访问节点。正确的方法是什么?使用XML::LibXML以外的解决方案欢迎。
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
use Carp;
use File::Basename;
use XML::LibXML 1.70;
my $filename = "VO.owl";
# Obtained from http://svn.code.sf.net/p/vaccineontology/code/trunk/src/ontology/VO.owl
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file( $filename );
foreach my $chunk ($doc->findnodes('/owl:Class')) {
my ($label) = $chunk->findnodes('./rdfs:label');
my ($subclass) = $chunk->findnodes('./rdfs:subClassOf');
print $label->to_literal;
print $subclass->to_literal;
}
将RDF当作XML进行解析是愚蠢的。完全相同的数据可以以多种不同的方式出现。例如,以下所有RDF文件都携带相同的数据。任何一致的RDF实现都必须以相同的方式处理它们。。。
<!-- example 1 -->
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="#me">
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person" />
<foaf:name>Toby Inkster</foaf:name>
</rdf:Description>
</rdf:RDF>
<!-- example 2 -->
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:foaf="http://xmlns.com/foaf/0.1/">
<foaf:Person rdf:about="#me">
<foaf:name>Toby Inkster</foaf:name>
</foaf:Person>
</rdf:RDF>
<!-- example 3 -->
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:foaf="http://xmlns.com/foaf/0.1/">
<foaf:Person rdf:about="#me" foaf:name="Toby Inkster" />
</rdf:RDF>
<!-- example 4 -->
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:foaf="">
<rdf:Description rdf:about="#me"
rdf:type="http://xmlns.com/foaf/0.1/Person"
foaf:name="Toby Inkster" />
</rdf:RDF>
<!-- example 5 -->
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:ID="me">
<rdf:type>
<rdf:Description rdf:about="http://xmlns.com/foaf/0.1/Person" />
</rdf:type>
<foaf:name>Toby Inkster</foaf:name>
</rdf:Description>
</rdf:RDF>
<!-- example 6 -->
<foaf:Person
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
rdf:about="#me"
foaf:name="Toby Inkster" />
我也可以很容易地列出六种其他变体,但我仅限于此。这个RDF文件只包含两个语句——I’m a Person;我的名字是"Toby Inkster"——OP的数据包含50000多条语句。
这只是RDF的XML序列化;还有其他序列化。
如果你尝试用XPath处理所有这些,你很可能会变成一个被锁在某个塔里的疯子,在睡梦中喃喃自语地谈论三元组;三元组。。。
幸运的是,格雷格·威廉姆斯为你打下了心理健康的子弹。RDF::Trine和RDF::Query不仅是Perl最好的RDF框架;它们是所有编程语言中最好的。
以下是如何使用RDF::Trine和RDF::Query:来实现OP的任务
#!/usr/bin/env perl
use v5.12;
use RDF::Trine;
use RDF::Query;
my $model = 'RDF::Trine::Model'->new(
'RDF::Trine::Store::DBI'->new(
'vo',
'dbi:SQLite:dbname=/tmp/vo.sqlite',
'', # no username
'', # no password
),
);
'RDF::Trine::Parser::RDFXML'->new->parse_url_into_model(
'http://svn.code.sf.net/p/vaccineontology/code/trunk/src/ontology/VO.owl',
$model,
) unless $model->size > 0;
my $query = RDF::Query->new(<<'SPARQL');
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?super_label ?sub_label
WHERE {
?sub rdfs:subClassOf ?super .
?sub rdfs:label ?sub_label .
?super rdfs:label ?super_label .
}
LIMIT 5
SPARQL
print $query->execute($model)->as_string;
样本输出:
+----------------------------+----------------------------------+
| super_label | sub_label |
+----------------------------+----------------------------------+
| "Aves vaccine" | "Ducks vaccine" |
| "route of administration" | "intravaginal route" |
| "Shigella gene" | "aroA from Shigella" |
| "Papillomavirus vaccine" | "Bovine papillomavirus vaccine" |
| "virus protein" | "Feline leukemia virus protein" |
+----------------------------+----------------------------------+
UPDATE:这里有一个SPARQL查询,可以插入上面的脚本来检索您想要的数据:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
SELECT ?subclass ?label
WHERE {
?subclass
rdfs:subClassOf obo:VO_0000001 ;
rdfs:label ?label .
}
/owl:Class
不是XML文档中的根元素。您必须将根元素包含到XPath:/rdf:RDF/owl:Class
中。或者,如果您想获得所有出现的内容,无论XML树的深度如何,都可以使用双斜杠表示法://owl:Class
。