将字符串分成主谓和宾语的三元组(三个字段的元组)



E.g:

RDF字符串示例。

<Tom_Wilkinson_(演员)><actedIn>"In_the_Bedroom"、"the_Patriot_(2000_film)"、"Black_Knight_(film)","the_Last_Kiss"、"Cassandras_Dream"<出生日期>"1948-12-12"<isCalled>"Tom Wilkinson(Schauspiler)",","トム・ウィルキンソン","Tom Wilkinson","Şõム・ウィルキンソン";.

给定字符串的Triples-

<Tom_Wilkinson_(actor)> <actedIn> "In_the_Bedroom"     
<Tom_Wilkinson_(actor)> <actedIn> "The_Patriot_(2000_film)" 
<Tom_Wilkinson_(actor)> <actedIn> "Black_Knight_(film)" 
<Tom_Wilkinson_(actor)> <actedIn> "The_Last_Kiss" 
<Tom_Wilkinson_(actor)> <actedIn> "Cassandras_Dream"
<Tom_Wilkinson_(actor)> <bornOnDate> "1948-12-12"
<Tom_Wilkinson_(actor)> <isCalled> "Tom Wilkinson (Schauspieler)"

注意-对象之间可能存在空格。比如说"汤姆·威尔金森(Schauspieller)"是一个介于两者之间的物体。

您给出的输入实际上是一些RDF的Turtle(或N3)序列化。它的格式通常是这样的,并指定一些@base

@base <http://stackoverflow.com/q/23192184/1281433> .
<Tom_Wilkinson_(actor)> <actedIn> "In_the_Bedroom" , "The_Patriot_(2000_film)" ,
                                  "Black_Knight_(film)" , "The_Last_Kiss" ,
                                  "Cassandras_Dream";
                        <bornOnDate> "1948-12-12";
                        <isCalled> "Tom Wilkinson (Schauspieler)" ,
                                   "טום וילקינסון" , "トム・ウィルキンソン" ,
                                   "Tom Wilkinson" , "ום וילקינסון" ,
                                   "ム・ウィルキンソン" .

如果添加适当的@base声明,那么可以使用任何可以读取Turtle并在N-Triples中序列化的库来读取输入和写入输出。例如,使用Jena的rdfcat,您可以转换为许多不同的格式,包括N-Triples:

$ rdfcat -out N-TRIPLES input.ttl
<http://stackoverflow.com/q/23192184/Tom_Wilkinson_(actor)> <http://stackoverflow.com/q/23192184/actedIn> "Black_Knight_(film)" .
<http://stackoverflow.com/q/23192184/Tom_Wilkinson_(actor)> <http://stackoverflow.com/q/23192184/isCalled> "ム・ウィルキンソン" .
<http://stackoverflow.com/q/23192184/Tom_Wilkinson_(actor)> <http://stackoverflow.com/q/23192184/isCalled> "トム・ウィルキンソン" .
<http://stackoverflow.com/q/23192184/Tom_Wilkinson_(actor)> <http://stackoverflow.com/q/23192184/isCalled> "Tom Wilkinson (Schauspieler)" .
<http://stackoverflow.com/q/23192184/Tom_Wilkinson_(actor)> <http://stackoverflow.com/q/23192184/isCalled> "ום וילקינסון" .
<http://stackoverflow.com/q/23192184/Tom_Wilkinson_(actor)> <http://stackoverflow.com/q/23192184/isCalled> "טום וילקינסון" .
<http://stackoverflow.com/q/23192184/Tom_Wilkinson_(actor)> <http://stackoverflow.com/q/23192184/actedIn> "The_Last_Kiss" .
<http://stackoverflow.com/q/23192184/Tom_Wilkinson_(actor)> <http://stackoverflow.com/q/23192184/bornOnDate> "1948-12-12" .
<http://stackoverflow.com/q/23192184/Tom_Wilkinson_(actor)> <http://stackoverflow.com/q/23192184/actedIn> "The_Patriot_(2000_film)" .
<http://stackoverflow.com/q/23192184/Tom_Wilkinson_(actor)> <http://stackoverflow.com/q/23192184/actedIn> "In_the_Bedroom" .
<http://stackoverflow.com/q/23192184/Tom_Wilkinson_(actor)> <http://stackoverflow.com/q/23192184/isCalled> "Tom Wilkinson" .
<http://stackoverflow.com/q/23192184/Tom_Wilkinson_(actor)> <http://stackoverflow.com/q/23192184/actedIn> "Cassandras_Dream" .

由于您使用Python对此进行了标记,您可能会发现RDFlib比Jena更有用,但这里真正的问题应该是如何进行转换,而不是库请求(因为库请求与Stack Overflow无关)。

尝试使用RDFLib。看起来他们有解析分词的例子

EDIT:格式实际上是n3。请参阅parse() 上的这些文档

相关内容

最新更新