我尝试在本地机器上访问DBpedia(de)数据。下载并解压了一些ttl文件后,我尝试测试一个非常简单的SPARQL查询。
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?s
WHERE
{
?s rdf:type skos:Concept .
?s rdfs:label ?label .
}
LIMIT 100
使用此ARQ命令(在Windows上):
arq --data dewiki-20140813-article-categories.ttl --query dbpedia_cat.rq
我本以为不会出什么问题,但我却犯了一堆这样的错误:
19:29:02 WARN riot :: [line: 2860693, col: 1 ] Bad IRI: <http:/
/de.dbpedia.org/resource/à_Baby_One_More_Time> Code: 47/NOT_NFKC in PATH: The IR
I is not in Unicode Normal Form KC.
19:29:02 WARN riot :: [line: 2860693, col: 1 ] Bad IRI: <http:/
/de.dbpedia.org/resource/à_Baby_One_More_Time> Code: 56/COMPATIBILITY_CHARACTER
in PATH: TODO
19:29:02 WARN riot :: [line: 2860694, col: 1 ] Bad IRI: <http:/
/de.dbpedia.org/resource/à_Baby_One_More_Time> Code: 47/NOT_NFKC in PATH: The IR
I is not in Unicode Normal Form KC.
19:29:02 WARN riot :: [line: 2860694, col: 1 ] Bad IRI: <http:/
/de.dbpedia.org/resource/à_Baby_One_More_Time> Code: 56/COMPATIBILITY_CHARACTER
in PATH: TODO
在这些错误之后,ARQ添加了以下内容:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceede
d
at org.apache.jena.riot.tokens.TokenizerText.parseToken(TokenizerText.ja
va:170)
at org.apache.jena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:
86)
at org.apache.jena.atlas.iterator.PeekIterator.fill(PeekIterator.java:50
)
at org.apache.jena.atlas.iterator.PeekIterator.next(PeekIterator.java:92
)
at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:99)
at org.apache.jena.riot.lang.LangTurtleBase.predicateObjectItem(LangTurt
leBase.java:287)
at org.apache.jena.riot.lang.LangTurtleBase.predicateObjectList(LangTurt
leBase.java:269)
at org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:
250)
at org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtl
eBase.java:191)
at org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.ja
va:44)
at org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.jav
a:90)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:42)
at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserR
egistry.java:182)
at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:906)
at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:687)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:534)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:501)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:454)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:432)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:422)
at arq.cmdline.ModDatasetGeneral.addGraphs(ModDatasetGeneral.java:101)
at arq.cmdline.ModDatasetGeneral.createDataset(ModDatasetGeneral.java:90
)
at arq.cmdline.ModDatasetGeneralAssembler.createDataset(ModDatasetGenera
lAssembler.java:35)
at arq.cmdline.ModDataset.getDataset(ModDataset.java:34)
at arq.query.getDataset(query.java:176)
at arq.query.queryExec(query.java:198)
at arq.query.exec(query.java:159)
at arq.cmdline.CmdMain.mainMethod(CmdMain.java:102)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
at arq.arq.main(arq.java:28)
在测试了两个开箱实用程序(Linux上的Ark和Windows上的Winrar)之后,我确信解压缩不是问题所在。
此外,我还用Notepad++查看了ttl文件,所有字符对我来说都是正确的,即使是有问题的字符,如É、Ö、Ü等。
所以,我不知道如何处理这些错误,如果有任何帮助,我将不胜感激!
(很抱歉问了一个与编程无关的问题。但我不知道这里的问题是JENA还是DBPedia,因此,哪个邮件列表是合适的。然而,无论如何,这都是初学者的问题。所以,我希望这里有人能帮上忙。)
警告只是警告,而不是错误。W3C标准不喜欢将数据编码为UTF-8。
这个
--data dewiki-20140813-article-categories.ttl
将所有数据加载到内存中,因此空间不足。要么加载到像TDB这样的数据库中,要么如果文件看起来可能存在于您机器上的内存中,则增加堆大小。