ANTLR 4语法给出了无关的输入错误



我试图创建一个(我认为)简单的语法,用于处理包含键/值分配列表的文件;每行一个任务。

我在过去(90年代中期)使用过ANTLR,并决定再次使用它,因为我想在赋值文件中提供注释以及Unicode关键字和值。

我的简单测试文件再次证明,即使使用好的工具,编写正确的语法也是一个难题。我正在使用VS 2012的ANTLR语言支持插件,并使用C#进行开发。因此,我对Eclipse/Java的保留很满意,但C#插件和ANTLR-Nuget包(运行时和代码生成器)的工作与广告中的完全一样。

我的语法文件是:

grammar AssignmentListFile;
/*
 * See: http://en.wikipedia.org/wiki/List_of_Unicode_characters
 * for list of Unicode Code Points
 */

/*
 * Lexer Rules: Must be in all UPPER case
 * Parser Rules: Must be in all lower case
 */
// Ignore All non-printable control characters except: CR, LF and SPACE
IGNORED_WHITESPACE : 
       (
         'u0000' .. 'u0009'  // 7-bit control chars less than Line Feed
       | 'u000B'  | 'u000C'  // Vertical tab and Form feed
       | 'u000E' .. 'u001F'  // 7-bit control chars more than Carriage Return
       | 'u007F' .. 'u009F'  // 8-bit ASCII control characters and DEL
       )+
     -> channel(HIDDEN)
     ;
// Ignore Comments and any ending white spaces
JAVADOC_COMMENT  
  : '/**' .*? '*/' [ rn]*
  -> channel(HIDDEN)
  ;
CSTYLE_COMMENT  
  : '/*'  .*? '*/'  [ rn]*
  -> channel(HIDDEN)
  ;
/*
 * Manage the assignment delimiter and 
 * the 3 white space characters which have not been ignored: SPACE, CR, and LF
 */
fragment SINGLE_SPACE : ' ';
EQUALS : '=';
EOL : SINGLE_SPACE* [rn]+ SINGLE_SPACE* ;
ASSIGNMENT_OPERATOR :  SINGLE_SPACE* EQUALS SINGLE_SPACE* ;
// define the various forms of single and double quotes for the dumb, open, and close variants 
                     //   ASCII    Open/Left  Close/Right
CHAR_SINGLEQUOTE : ('u0027' | 'u2018' | 'u2019') ;
CHAR_DOUBLEQUOTE : ('u0022' | 'u201C' | 'u201D') ;
/*
 * create the character sets that can be part of an ID
 */
fragment IDCHAR_COMMON : 
         ( 'u0020'  | 'u0021'  // Space and bang (!)
         | 'u0023' .. 'u0026'  // # to & (skips ")
         | 'u0028' .. 'u003C'  // ( to < (skips ')
         | 'u003E' .. 'u007E'  // > to ~ (skips =)
         | 'u00A0' .. 'u2018'  // printable UNICODE code points below  Open Single Quote
         | 'u201A' .. 'u201B'  // printable UNICODE code points between Close Single Quote and Open Double Quote
         | 'u201E' .. 'uFFFF'  // printable UNICODE code points above Close Double Quote
         )
       ;

// define the characters that can be contained in each of the quoted identifier types
NON_QUOTED_VALUE : IDCHAR_COMMON+;
DOUBLE_QUOTED_VALUE : NON_QUOTED_VALUE 
          | (IDCHAR_COMMON |  CHAR_SINGLEQUOTE | EQUALS)+
          ;
SINGLE_QUOTED_VALUE : NON_QUOTED_VALUE 
          | (IDCHAR_COMMON |  CHAR_DOUBLEQUOTE | EQUALS)+
          ;
file : file_line* EOF ;
file_line 
  : assignment
  | EOL
  ;
assignment
  : identifier  ASSIGNMENT_OPERATOR  identifier 
  ;
identifier 
    : NON_QUOTED_VALUE 
    | CHAR_DOUBLEQUOTE DOUBLE_QUOTED_VALUE CHAR_DOUBLEQUOTE 
    | CHAR_SINGLEQUOTE SINGLE_QUOTED_VALUE CHAR_SINGLEQUOTE
    ;

我的输入文件是:

/*
 * This is a Multiline C-Style comment
 * with white space here:   
 */
/* this is a single line C-Style comment  */
/* this is a single line C-Style comment /w whitepace */
/*      
  */
/**/
/**
 * this is a Multiline JavaDoc comment
 * with white space here:    
 */
/** this is a single line JavaDoc comment */
/**     
  */
  /***/     
JOHN=WASHBURN
 JOHN = WASHBURN 
'JOHN'='WASHBURN'
"JOHN" = "WASHBURN"

调用Lexer/Parser的C#代码是:

  var input = new AntlrInputStream(textStream.ReadToEnd());
  var lexer = new AssignmentListFileLexer(input);
  var tokens = new CommonTokenStream(lexer);
  var parser = new AssignmentListFileParser(tokens);
  Console.WriteLine("n");
  IParseTree tree = parser.file();
  Console.WriteLine(tree.ToStringTree(parser));
  Console.WriteLine("n");

当您对测试文件调用这个C#时,NUnit的结果是:

line 23:0 extraneous input 'JOHN=WASHBURN' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
line 24:1 extraneous input 'JOHN = WASHBURN ' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
line 25:0 extraneous input ''JOHN'='WASHBURN'' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
line 26:0 extraneous input '"JOHN" = "WASHBURN"' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
(file JOHN=WASHBURN (file_line rn ) JOHN = WASHBURN  (file_line rn) 'JOHN'='WASHBURN' (file_line rn) "JOHN" = "WASHBURN" <EOF>)

首先,你可以看到,我甚至还没有开始测试有趣的选项(例如,德语名称/值,包含=符号或其他引号字符的带引号的ID,等等)。测试文件都是可忽略的空白和/或注释,可以按预期进行解析。打印的树显示行结束(EOL)逻辑似乎已步入正轨。但是,赋值表达式本身的解析是发生识别错误的地方。

我很困惑4个字符的短语JOHN(或短语WASHBURN)如何与NON_QUOTED_VALUE不匹配,或者"JOHN"如何与CHAR_SINGLE_QUOTE不匹配。或者"="或"="如何与分配规则不匹配。

我相信这将是一个DOH!!此刻,但我错过了什么?

4个字符的短语JOHN未被识别为NON_QUOTED_VALUE令牌的原因是JOHN=WASHBURN被识别为DOUBLE_QUOTED_ALUE。用下面的跟踪检测语法会显示这一点(对不起,Java代码,但我相信你可以翻译)。

NON_QUOTED_VALUE : IDCHAR_COMMON+  {System.out.println("#A:"+getText());};
DOUBLE_QUOTED_VALUE : NON_QUOTED_VALUE 
          | (IDCHAR_COMMON |  CHAR_SINGLEQUOTE | EQUALS)+ {System.out.println("#B:"+getText());}
          ;
SINGLE_QUOTED_VALUE : NON_QUOTED_VALUE 
          | (IDCHAR_COMMON |  CHAR_DOUBLEQUOTE | EQUALS)+ {System.out.println("#C:"+getText());}
          ;

生成以下输出。。。

#B:JOHN=WASHBURN
#B:JOHN = WASHBURN 
#B:'JOHN'='WASHBURN'
#C:"JOHN" = "WASHBURN"

其原因是识别最长匹配的lexer规则具有优先级。

如果有帮助的话,下面的语法应该可以识别您的示例文件。

CHAR_SINGLEQUOTE : ('u0027' | 'u2018' | 'u2019') ;
CHAR_DOUBLEQUOTE : ('u0022' | 'u201C' | 'u201D') ;
EQUALS : '=';
EOL : [rn]+ ;
IGNORED_WHITESPACE : 
       ( ' '
       | 'u0000' .. 'u0009'  // 7-bit control chars less than Line Feed
       | 'u000B'  | 'u000C'  // Vertical tab and Form feed
       | 'u000E' .. 'u001F'  // 7-bit control chars more than Carriage Return
       | 'u007F' .. 'u009F'  // 8-bit ASCII control characters and DEL
       )+
     -> channel(HIDDEN)
     ;
IDCHAR_COMMON : 
         ( 'u0020'  | 'u0021'  // Space and bang (!)
         | 'u0023' .. 'u0026'  // # to & (skips ")
         | 'u0028' .. 'u003C'  // ( to < (skips ')
         | 'u003E' .. 'u007E'  // > to ~ (skips =)
         | 'u00A0' .. 'u2018'  // printable UNICODE code points below  Open Single Quote
         | 'u201A' .. 'u201B'  // printable UNICODE code points between Close Single Quote and Open Double Quote
         | 'u201E' .. 'uFFFF'  // printable UNICODE code points above Close Double Quote
         )
       ;
NON_QUOTED_VALUE : IDCHAR_COMMON+  {System.out.println("#A:"+getText());};
JAVADOC_COMMENT  
  : '/**' .*? '*/' [ rn]*
  -> channel(HIDDEN)
  ;
CSTYLE_COMMENT  
  : '/*'  .*? '*/'  [ rn]*
  -> channel(HIDDEN)
  ;

file : file_line* EOF ;
file_line 
  : assignment
  | EOL
  ;
assignment
  : identifier  EQUALS  identifier 
  ;
identifier : NON_QUOTED_VALUE 
           | CHAR_DOUBLEQUOTE (NON_QUOTED_VALUE |  CHAR_SINGLEQUOTE | EQUALS)+ CHAR_DOUBLEQUOTE 
           | CHAR_SINGLEQUOTE (NON_QUOTED_VALUE |  CHAR_DOUBLEQUOTE | EQUALS)+ CHAR_SINGLEQUOTE ;

这也应该分析以下内容,我从阅读你的语法中认为这些内容是有效的。

'JO"HN'='WASHBURN'
"JO='HN" = "WASHBURN"

相关内容

  • 没有找到相关文章

最新更新