ANTLR4 解析器问题



我正在尝试为 c++ 样式的头文件编写解析器,但未能正确配置解析器。

词法分析器:

lexer grammar HeaderLexer;
SectionLineComment
:   LINE_COMMENT_SIGN Section CharacterSequence
;
Pragma
: POUND 'pragma'
;
Section
:  AT_SIGN 'section'
;
Define
: POUND 'define'
| LINE_COMMENT_SIGN POUND 'define'
;
Booleanliteral
: False
| True
;
QuotedCharacterSequence
:   '"' .*?  '"'
;
ArraySequence
:   '{' .*?  '}'
|   '[' .*?  ']'
;
IntNumber
:   Digit+
;
DoubleNumber
:   Digit+ POINT Digit+
|   ZERO POINT Digit+
;
CharacterSequence
:   Text+
;
Identifier
:   [a-zA-Z_0-9]+
;
BlockComment
: '/**' .*? '*/'
;
LineComment
:   LINE_COMMENT_SIGN ~[rn]*
;
EmptyLineComment
:   LINE_COMMENT_SIGN -> skip
;
Newline
:   (   'r' 'n'?
|   'n'
)
-> skip
;
WhiteSpace
: [ rnt]+ -> skip;
fragment POUND : '#';
fragment AT_SIGN : '@';
fragment LINE_COMMENT_SIGN : '//';
fragment POINT : '.';
fragment ZERO : '0';
fragment Digit
:   [0-9]
;
fragment Text
:   [a-zA-Z0-9.]
;

fragment False
: 'false'
;
fragment True
: 'true'
;

解析 器:

parser grammar HeaderParser;
options { tokenVocab=HeaderLexer; }
compilationUnit: statement* EOF;
statement
: comment? pragmaDirective
| comment? defineDirective
| section
| comment
;
pragmaDirective
:   Pragma CharacterSequence
;
defineDirective
:   Define Identifier Booleanliteral LineComment?
|   Define Identifier DoubleNumber LineComment?
|   Define Identifier IntNumber LineComment?
|   Define Identifier CharacterSequence LineComment?
|   Define Identifier QuotedCharacterSequence LineComment?
|   Define Identifier ArraySequence LineComment?
|   Define Identifier
;
section: SectionLineComment;
comment
: BlockComment
| LineComment+
;

要分析的文本:

/**
* BLOCK COMMENT
*/
#pragma once
/**
* BLOCK COMMENT
*/
#define CONFIGURATION_H_VERSION 12345
#define IDENTIFIER abcd
#define IDENTIFIER_1 abcd
#define IDENTIFIER_1 abcd.dd
#define IDENTIFIER_2 true // Line
#define IDENTIFIER_20 {ONE, TWO} // Line
#define IDENTIFIER_20_30   { 1, 2, 3, 4 }
#define IDENTIFIER_20_30_A   [ 1, 2, 3, 4 ]
#define DEFAULT_A 10.0
//================================================================
//============================= INFO =============================
//================================================================
/**
* SEPARATE BLOCK COMMENT
*/
//==================================================================
//============================= INFO ===============================
//==================================================================
// Line 1
// Line 2
//
// @section test
// Line 3
#define IDENTIFIER_TWO "(ONE, TWO, THREE)" // Line 4
//#define IDENTIFIER_3 Version.h // Line 5
// Line 6
#define IDENTIFIER_THREE

使用此配置,我有几个问题:

解析
  1. 器无法正确解析第 11 行上的"#define 标识符 abcd">
  2. 第 36 行上的"//@section test"被解析为行注释,但我需要将其解析为单独的标记
  3. 注释的定义指令的解析不起作用 "//#define IDENTIFIER_3 Version.h//第 5 行">

每当解析时出现问题时,您应该检查词法分析器正在生成哪种标记。

以下是词法分析生成的令牌:

BlockComment              `/**n * BLOCK COMMENTn */`
Pragma                    `#pragma`
CharacterSequence         `once`
BlockComment              `/**n * BLOCK COMMENTn */`
Define                    `#define`
Identifier                `CONFIGURATION_H_VERSION`
IntNumber                 `12345`
Define                    `#define`
CharacterSequence         `IDENTIFIER`
CharacterSequence         `abcd`
Define                    `#define`
Identifier                `IDENTIFIER_1`
CharacterSequence         `abcd`
Define                    `#define`
Identifier                `IDENTIFIER_1`
CharacterSequence         `abcd.dd`
Define                    `#define`
Identifier                `IDENTIFIER_2`
Booleanliteral            `true`
LineComment               `// Line`
Define                    `#define`
Identifier                `IDENTIFIER_20`
ArraySequence             `{ONE, TWO}`
LineComment               `// Line`
Define                    `#define`
Identifier                `IDENTIFIER_20_30`
ArraySequence             `{ 1, 2, 3, 4 }`
Define                    `#define`
Identifier                `IDENTIFIER_20_30_A`
ArraySequence             `[ 1, 2, 3, 4 ]`
Define                    `#define`
Identifier                `DEFAULT_A`
DoubleNumber              `10.0`
LineComment               `//================================================================`
LineComment               `//============================= INFO =============================`
LineComment               `//================================================================`
BlockComment              `/**n * SEPARATE BLOCK COMMENTn */`
LineComment               `//==================================================================`
LineComment               `//============================= INFO ===============================`
LineComment               `//==================================================================`
LineComment               `// Line 1`
LineComment               `// Line 2`
LineComment               `//`
LineComment               `// @section test`
LineComment               `// Line 3`
Define                    `#define`
Identifier                `IDENTIFIER_TWO`
QuotedCharacterSequence   `"(ONE, TWO, THREE)"`
LineComment               `// Line 4`
LineComment               `//#define IDENTIFIER_3 Version.h // Line 5`
LineComment               `// Line 6`
Define                    `#define`
Identifier                `IDENTIFIER_THREE`

如上面的列表中所示,#define IDENTIFIER abcd未正确解析,因为它生成以下标记:

Define                    `#define`
CharacterSequence         `IDENTIFIER`
CharacterSequence         `abcd`

因此不能与解析器规则匹配:

defineDirective
:   ...
|   Define Identifier CharacterSequence LineComment?
|   ...
;

如您所见,词法分析器独立于解析器运行。无论解析器是否尝试匹配文本"IDENTIFIER"Identifier,词法分析器都会为此生成一个CharacterSequence标记。

词法分析器仅基于 2 条规则创建令牌:

  1. 尝试匹配尽可能多的字符
  2. 如果 2 个(或更多(词法分析器规则可以匹配相同的字符,则首先定义的规则"获胜">

由于上述规则,//#define IDENTIFIER_3 Version.h // Line 5被标记为LineComment(规则 1 适用:尽可能匹配(。像once这样的输入被标记为CharacterSequence而不是Identifier(规则2适用:CharacterSequenceIdentifier之前定义(

若要在注释内外#define一视同仁,可以使用词法模式。每当词法分析器看到一个//,它就会进入一个特殊的注释模式,一旦进入这个注释模式,你也会识别#define@section标记。当您看到其中一个令牌时(或者当您看到换行符时,当然(,您可以保留此模式。

快速演示一下它的外观:

lexer grammar HeaderLexer;
SPACES          : [ rnt]+ -> skip;
COMMENT_START   : '//' -> pushMode(COMMENT_MODE);
PRAGMA          : '#pragma';
SECTION         : '@section';
DEFINE          : '#define';
BOOLEAN_LITERAL :  'true' | 'false';
STRING          : '"' .*? '"';
IDENTIFIER      : [a-zA-Z_] [a-zA-Z_0-9]*;
BLOCK_COMMENT   : '/**' .*? '*/';
OTHER           : .;
NUMBER          : [0-9]+ ('.' [0-9]+)?;
CHAR_SEQUENCE   : [a-zA-Z_] [a-zA-Z_0-9.]*;
ARRAY_SEQUENCE  : '{' .*?  '}' | '[' .*?  ']';
mode COMMENT_MODE;
// If we match one of the followinf 3 rules, leave this comment mode
COMMENT_MODE_DEFINE     : '#define' -> type(DEFINE), popMode;
COMMENT_MODE_SECTION    : '@section' -> type(SECTION), popMode;
COMMENT_MODE_LINE_BREAK : [rn]+ -> skip, popMode;
// If none of the 3 rules above matched, consume a single
// character (which is part of the comment)
COMMENT_MODE_PART       : ~[rn];

然后解析器可能如下所示:

parser grammar HeaderParser;
options { tokenVocab=HeaderLexer; }
compilationUnit
: statement* EOF
;
statement
: comment? pragmaDirective
| comment? defineDirective
| sectionLineComment
| comment
;
pragmaDirective
:   PRAGMA char_sequence
;
defineDirective
: DEFINE IDENTIFIER BOOLEAN_LITERAL line_comment?
| DEFINE IDENTIFIER NUMBER line_comment?
| DEFINE IDENTIFIER char_sequence line_comment?
| DEFINE IDENTIFIER STRING line_comment?
| DEFINE IDENTIFIER ARRAY_SEQUENCE line_comment?
| DEFINE IDENTIFIER
;
sectionLineComment
: COMMENT_START COMMENT_MODE_PART? SECTION char_sequence
;
comment
: BLOCK_COMMENT
| line_comment
;
line_comment
: COMMENT_START COMMENT_MODE_PART*
;
char_sequence
: CHAR_SEQUENCE
| IDENTIFIER
;

相关内容

  • 没有找到相关文章

最新更新