我正在尝试解析mediawiki标记,特别是英语维基文章中使用的标记
它不是一种编程语言,对空白和换行的处理有点奇怪,而且我觉得每一步都是尝试和(很多(错误。
这是回购:https://github.com/WorDB/wikitext-parser
测试输入文件是pie文章:pie.txt
(https://en.wiktionary.org/wiki/pie)
注意:我正在解析wiktionary的整个XML转储,所以我宁愿找到一个使用Antlr解析的解决方案,而不是获得使用某些在线API之类的建议。
wikitext.g4
grammar wikitext;
/**
Grammar
*/
page: EOL? ((wikitem | bullet_line) EOL? )+ EOF;
wikitem:
wikitem wikitem
| title
| template
| link
| text
;
title: title2 | title3 | title4 | title5;
title5: '=====' text '=====';
title4: '====' text '====';
title3: '===' text '===';
title2: '==' text '==';
template: '{{' parameter ('|' parameter)* '}}';
link: '[[' parameter ('|' parameter)* ']]';
parameter: wikitem?; // parameter can be empty, I.E. {{a|}}
bullet: ('*'|'#'|'#:'|'#*');
bullet_line: WS? EOL WS? bullet WS? wikitem;
text: (CHAR | WS)+;
/**
Lexicon
*/
EOL: [frn]+;
CHAR: ~[ tfrn];
WS: [ t]+;
Error:
> cd ./java && grun wikitext page -gui ../data/pie.txt
line 190:137 no viable alternative at input 'rom {{inh|en|enm|pye}}, from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'om {{inh|en|enm|pye}}, from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'm {{inh|en|enm|pye}}, from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{inh|en|enm|pye}}, from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'rom {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'om {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'm {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'rom {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'om {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'm {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'eminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'minine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'inine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'nine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'ine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'ne of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'e of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'f {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'rom {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'om {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'm {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{der|en|ine-pro|*'
line 190:137 extraneous input '*' expecting {'|', '}}'}
line 190:146 no viable alternative at input 's)peyk-|'
line 190:146 no viable alternative at input ')peyk-|'
line 190:146 no viable alternative at input 'peyk-|'
line 190:146 no viable alternative at input 'eyk-|'
line 190:146 no viable alternative at input 'yk-|'
line 190:146 no viable alternative at input 'k-|'
line 190:146 no viable alternative at input '-|'
line 190:146 mismatched input '|' expecting {<EOF>, '=====', '====', '===', '==', '{{', '[[', EOL, CHAR, WS}
我更改了一些规则。你能查一下吗?
grammar wikitext;
/**
Grammar
*/
page: EOL? (wikitem EOL? )+ EOF;
wikitem:
wikitem wikitem
| title
| template
| link
| text
| bullet_line
;
title: title2 | title3 | title4 | title5;
title5: '=====' text '=====';
title4: '====' text '====';
title3: '===' text '===';
title2: '==' text '==';
template: '{{' parameter ('|' parameter)* '}}';
link: '[[' parameter ('|' parameter)* ']]';
parameter: wikitem?; // parameter can be empty, I.E. {{a|}}
bullet_line: WS? bullet=('*'|'#'|'#:'|'#*') WS? wikitem;
text: (CHAR | WS)+;
/**
Lexicon
*/
EOL: [frn]+;
CHAR: ~[ tfrn];
WS: [ t]+;