我正在尝试识别文档中的数字。如图3所示的字符串。Caption"或见图3。有些图可能包含子索引('1A'或'1.1'或'3.A'),有时标题和点之间没有空格。我有以下带有规则的ANTLR文件:
grammar Text;
text : space?
(
keywordFigure
| label
| space
| dot
| comma
| word )*;
FIGURE : 'Figure';
keywordFigure : FIGURE;
LABEL_TOKEN : [0-9]+ [a-zA-Z]?
| [0-9]+ '.' [0-9]+
| [0-9]+ '.' [a-zA-Z];
label : LABEL_TOKEN;
/* Separators */
SPACE_TOKEN : [ tnr]+;
space : SPACE_TOKEN;
DOT : '.';
dot : DOT;
COMMA : ',';
comma : COMMA;
fragment WORD_CHAR : ~[0-9 tnr.,];
WORD_TOKEN : WORD_CHAR+;
word : WORD_TOKEN;
这个语法在一些例子中不能正确工作:
Figure 6.Regulation
标签是"6 "但不是"6"。因为ANTLR优先级(选择匹配最长输入的词法分析器规则)。
See in Figure 2.1, 86Formula and other text
标签为"2.1"one_answers"86 f" .
是否有一种方法来确定这个规则的边界?
自由文本比编程语言更难解析。这不是我第一次看到像Mike的评论那样令人沮丧地使用ANTLR的答案。尽管如此,我还是试图找到一个解决办法。
对于您的语法,词法分析器规则LABEL_TOKEN
系统地使用数字或数字点后面的第一个字母,如在列出词法分析器生成的标记时所示:
% grun Text text -tokens input.txt
[@0,0:5='Figure',<'Figure'>,1:0]
[@1,6:6=' ',<SPACE_TOKEN>,1:6]
[@2,7:9='6.R',<LABEL_TOKEN>,1:7]
[@3,10:18='egulation',<WORD_TOKEN>,1:10]
...
[@21,62:64='86F',<LABEL_TOKEN>,3:19]
[@22,65:70='ormula',<WORD_TOKEN>,3:22]
在文件Question.g4
中使用此语法:
grammar Question;
text
@init {System.out.println("Question last update 2046");}
: line+ EOF ;
line
: ( figure
| ID
| NUMBER
| PUNCTUATION
| SPACE
)+ NL?
;
figure
: 'Figure' SPACE figure_number
;
figure_number
: NUMBER
{ System.out.println("NUMBER ID current tk txt=" + getCurrentToken().getText() + " line=" + getCurrentToken().getLine() + " pos=" + getCurrentToken().getCharPositionInLine()); }
{ getCurrentToken().getText().length() == 1 }? ID
{ System.out.println(" >> C " + $start.getTokenIndex() + "-" + ($ID == null ? "?" : $ID.getTokenIndex()) + " text=" + ($ID == null ? $NUMBER.text : $NUMBER.text + $ID.text)); }
| NUMBER_DOT
{ System.out.println("NUMBER_DOT ID current tk txt=" + getCurrentToken().getText() + " line=" + getCurrentToken().getLine() + " pos=" + getCurrentToken().getCharPositionInLine()); }
{ getCurrentToken().getText().length() == 1 }? ID
{ System.out.println(" >> D " + $start.getTokenIndex() + "-" + ($ID == null ? "?" : $ID.getTokenIndex()) + " text=" + ($ID == null ? $NUMBER_DOT.text : $NUMBER_DOT.text + $ID.text)); }
| NUMBER_DOT NUMBER
{ System.out.println("NUMBER_DOT NUMBER ID current tk txt=" + getCurrentToken().getText() + " line=" + getCurrentToken().getLine() + " pos=" + getCurrentToken().getCharPositionInLine()); }
{ getCurrentToken().getText().length() == 1 }? ID
{ System.out.println(" >> E " + $start.getTokenIndex() + "-" + ($ID == null ? "?" : $ID.getTokenIndex()) + " text=" + ($ID == null ? $NUMBER_DOT.text : $NUMBER_DOT.text + $ID.text)); }
| NUMBER { System.out.println(" >> A " + $start.getTokenIndex() + " text=" + $NUMBER.text); }
| NUMBER_DOT NUMBER? { System.out.println(" >> B " + $start.getTokenIndex() + " text=" + $NUMBER_DOT.text + $NUMBER.text); }
;
ID : LETTER ( LETTER | DIGIT )* ;
NUMBER : DIGIT+ ;
NUMBER_DOT : DIGIT+ '.' ;
PUNCTUATION : [.,;:!?-()'"’] ;
NL : [rn]+ ;
SPACE : ' '+ ;
TAB : [t]+ -> skip ;
fragment DIGIT : [0-9] ;
fragment LETTER : [a-zA-Z] ;
和以下input.txt
文件的输入:
Figure 6.Regulation
Figure 6 Regulation
Figure 6.1A Regulation
See in Figure 2.1, 86Formula and other text
See Figure 3A xyz.
See Figure 3.1A abc.
Since you’re (apparently ) just trying ... - to identify the structure -, I’d really suggest : a Regex matcher, see Figure 3A.
and given:
alias a4='java -jar /usr/local/lib/antlr-4.11.1-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'
export CLASSPATH=.:/usr/local/lib/antlr-4.11.1-complete.jar
执行:
% a4 Question.g4
% javac Q*.java
% grun Question text -tokens -gui input.txt
[@0,0:5='Figure',<'Figure'>,1:0]
[@1,6:6=' ',<SPACE>,1:6]
[@2,7:8='6.',<NUMBER_DOT>,1:7]
[@3,9:18='Regulation',<ID>,1:9]
...
图形很好,输出是:
Question last update 2046
NUMBER_DOT ID current tk txt=Regulation line=1 pos=9
line 1:9 rule figure_number failed predicate: { getCurrentToken().getText().length() == 1 }?
>> A 7 text=6
NUMBER_DOT NUMBER ID current tk txt=A line=3 pos=10
>> E 13-15 text=6.A
>> B 25 text=2.1
NUMBER ID current tk txt=A line=5 pos=12
>> C 42-43 text=3A
NUMBER_DOT NUMBER ID current tk txt=A line=6 pos=14
>> E 52-54 text=3.A
NUMBER ID current tk txt=A line=8 pos=124
>> C 112-113 text=3A
对于输入Figure 6.Regulation
,语义谓词
{ getCurrentToken().getText().length() == 1 }?
禁止使用figure_number规则的第一个可选参数NUMBER ID。
为了给备选项C、D和E一个使用的机会,我必须移动规则figure_number末尾的a和B。第15.7节的解释,书第286页,PDF第288页:
ANTLR的一般决策策略是找到所有可行的选项,然后忽略用谓词保护的选项当前的值为false。(一个可行的替代方案是匹配当前输入。)如果不止一个可行的选择选项中首先指定的备选项将被解析器选择决定。
figure_number
: NUMBER ... >> A
| NUMBER_DOT NUMBER? ... >> B
解析决定给出一个完全不同的结果:
Question last update 0853
>> B 2 text=6.null
>> A 7 text=6
>> B 13 text=6.1
>> B 25 text=2.1
>> A 42 text=3
>> B 52 text=3.1
>> A 112 text=3
为了完整起见,可以使用一个侦听器,文件MyListener.java
:
public class MyListener extends QuestionBaseListener {
QuestionParser parser;
public MyListener(QuestionParser parser) { this.parser = parser; }
public void exitLine(QuestionParser.LineContext ctx) {
System.out.println(">>> in MyListener for line");
System.out.println(parser.getTokenStream().getText(ctx));
}
}
和一个应用程序,文件Job.java
:
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;
public class Job {
public static void main(String[] args) throws Exception {
System.out.println("==========Java Job 4/1/2023 14:32");
String input_file_name = "";
if ( args.length < 1 )
throw new Exception("Missing input file name on command line");
input_file_name = args[0];
CharStream input = CharStreams.fromFileName(input_file_name);
System.out.println("==========Java Job start parsing ==========");
QuestionLexer lexer = new QuestionLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
QuestionParser parser = new QuestionParser(tokens);
ParseTree tree = parser.text();
System.out.println("==========Java Job parsing ended");
ParseTreeWalker walker = new ParseTreeWalker();
System.out.println("==========Java Job about to new MyListener");
MyListener listener = new MyListener(parser);
System.out.println("==========Java Job about to walk");
walker.walk(listener, tree);
System.out.println("==========Java Job tree size=" + tree.getChildCount());
}
}
,输出解析器看到的行:
% java Job input.txt
...
==========Java Job about to walk
>>> in MyListener for line
Figure 6.Regulation
...
>>> in MyListener for line
See in Figure 2.1, 86Formula and other text