我正在尝试为 c++ 样式的头文件编写解析器,但未能正确配置解析器。
词法分析器:
lexer grammar HeaderLexer;
SectionLineComment
: LINE_COMMENT_SIGN Section CharacterSequence
;
Pragma
: POUND 'pragma'
;
Section
: AT_SIGN 'section'
;
Define
: POUND 'define'
| LINE_COMMENT_SIGN POUND 'define'
;
Booleanliteral
: False
| True
;
QuotedCharacterSequence
: '"' .*? '"'
;
ArraySequence
: '{' .*? '}'
| '[' .*? ']'
;
IntNumber
: Digit+
;
DoubleNumber
: Digit+ POINT Digit+
| ZERO POINT Digit+
;
CharacterSequence
: Text+
;
Identifier
: [a-zA-Z_0-9]+
;
BlockComment
: '/**' .*? '*/'
;
LineComment
: LINE_COMMENT_SIGN ~[rn]*
;
EmptyLineComment
: LINE_COMMENT_SIGN -> skip
;
Newline
: ( 'r' 'n'?
| 'n'
)
-> skip
;
WhiteSpace
: [ rnt]+ -> skip;
fragment POUND : '#';
fragment AT_SIGN : '@';
fragment LINE_COMMENT_SIGN : '//';
fragment POINT : '.';
fragment ZERO : '0';
fragment Digit
: [0-9]
;
fragment Text
: [a-zA-Z0-9.]
;
fragment False
: 'false'
;
fragment True
: 'true'
;
解析 器:
parser grammar HeaderParser;
options { tokenVocab=HeaderLexer; }
compilationUnit: statement* EOF;
statement
: comment? pragmaDirective
| comment? defineDirective
| section
| comment
;
pragmaDirective
: Pragma CharacterSequence
;
defineDirective
: Define Identifier Booleanliteral LineComment?
| Define Identifier DoubleNumber LineComment?
| Define Identifier IntNumber LineComment?
| Define Identifier CharacterSequence LineComment?
| Define Identifier QuotedCharacterSequence LineComment?
| Define Identifier ArraySequence LineComment?
| Define Identifier
;
section: SectionLineComment;
comment
: BlockComment
| LineComment+
;
要分析的文本:
/**
* BLOCK COMMENT
*/
#pragma once
/**
* BLOCK COMMENT
*/
#define CONFIGURATION_H_VERSION 12345
#define IDENTIFIER abcd
#define IDENTIFIER_1 abcd
#define IDENTIFIER_1 abcd.dd
#define IDENTIFIER_2 true // Line
#define IDENTIFIER_20 {ONE, TWO} // Line
#define IDENTIFIER_20_30 { 1, 2, 3, 4 }
#define IDENTIFIER_20_30_A [ 1, 2, 3, 4 ]
#define DEFAULT_A 10.0
//================================================================
//============================= INFO =============================
//================================================================
/**
* SEPARATE BLOCK COMMENT
*/
//==================================================================
//============================= INFO ===============================
//==================================================================
// Line 1
// Line 2
//
// @section test
// Line 3
#define IDENTIFIER_TWO "(ONE, TWO, THREE)" // Line 4
//#define IDENTIFIER_3 Version.h // Line 5
// Line 6
#define IDENTIFIER_THREE
使用此配置,我有几个问题:
解析- 器无法正确解析第 11 行上的"#define 标识符 abcd">
- 第 36 行上的"//@section test"被解析为行注释,但我需要将其解析为单独的标记
- 注释的定义指令的解析不起作用 "//#define IDENTIFIER_3 Version.h//第 5 行">
每当解析时出现问题时,您应该检查词法分析器正在生成哪种标记。
以下是词法分析生成的令牌:
BlockComment `/**n * BLOCK COMMENTn */`
Pragma `#pragma`
CharacterSequence `once`
BlockComment `/**n * BLOCK COMMENTn */`
Define `#define`
Identifier `CONFIGURATION_H_VERSION`
IntNumber `12345`
Define `#define`
CharacterSequence `IDENTIFIER`
CharacterSequence `abcd`
Define `#define`
Identifier `IDENTIFIER_1`
CharacterSequence `abcd`
Define `#define`
Identifier `IDENTIFIER_1`
CharacterSequence `abcd.dd`
Define `#define`
Identifier `IDENTIFIER_2`
Booleanliteral `true`
LineComment `// Line`
Define `#define`
Identifier `IDENTIFIER_20`
ArraySequence `{ONE, TWO}`
LineComment `// Line`
Define `#define`
Identifier `IDENTIFIER_20_30`
ArraySequence `{ 1, 2, 3, 4 }`
Define `#define`
Identifier `IDENTIFIER_20_30_A`
ArraySequence `[ 1, 2, 3, 4 ]`
Define `#define`
Identifier `DEFAULT_A`
DoubleNumber `10.0`
LineComment `//================================================================`
LineComment `//============================= INFO =============================`
LineComment `//================================================================`
BlockComment `/**n * SEPARATE BLOCK COMMENTn */`
LineComment `//==================================================================`
LineComment `//============================= INFO ===============================`
LineComment `//==================================================================`
LineComment `// Line 1`
LineComment `// Line 2`
LineComment `//`
LineComment `// @section test`
LineComment `// Line 3`
Define `#define`
Identifier `IDENTIFIER_TWO`
QuotedCharacterSequence `"(ONE, TWO, THREE)"`
LineComment `// Line 4`
LineComment `//#define IDENTIFIER_3 Version.h // Line 5`
LineComment `// Line 6`
Define `#define`
Identifier `IDENTIFIER_THREE`
如上面的列表中所示,#define IDENTIFIER abcd
未正确解析,因为它生成以下标记:
Define `#define`
CharacterSequence `IDENTIFIER`
CharacterSequence `abcd`
因此不能与解析器规则匹配:
defineDirective
: ...
| Define Identifier CharacterSequence LineComment?
| ...
;
如您所见,词法分析器独立于解析器运行。无论解析器是否尝试匹配文本"IDENTIFIER"
的Identifier
,词法分析器都会为此生成一个CharacterSequence
标记。
词法分析器仅基于 2 条规则创建令牌:
- 尝试匹配尽可能多的字符
- 如果 2 个(或更多(词法分析器规则可以匹配相同的字符,则首先定义的规则"获胜">
由于上述规则,//#define IDENTIFIER_3 Version.h // Line 5
被标记为LineComment
(规则 1 适用:尽可能匹配(。像once
这样的输入被标记为CharacterSequence
而不是Identifier
(规则2适用:CharacterSequence
在Identifier
之前定义(
若要在注释内外#define
一视同仁,可以使用词法模式。每当词法分析器看到一个//
,它就会进入一个特殊的注释模式,一旦进入这个注释模式,你也会识别#define
和@section
标记。当您看到其中一个令牌时(或者当您看到换行符时,当然(,您可以保留此模式。
快速演示一下它的外观:
lexer grammar HeaderLexer;
SPACES : [ rnt]+ -> skip;
COMMENT_START : '//' -> pushMode(COMMENT_MODE);
PRAGMA : '#pragma';
SECTION : '@section';
DEFINE : '#define';
BOOLEAN_LITERAL : 'true' | 'false';
STRING : '"' .*? '"';
IDENTIFIER : [a-zA-Z_] [a-zA-Z_0-9]*;
BLOCK_COMMENT : '/**' .*? '*/';
OTHER : .;
NUMBER : [0-9]+ ('.' [0-9]+)?;
CHAR_SEQUENCE : [a-zA-Z_] [a-zA-Z_0-9.]*;
ARRAY_SEQUENCE : '{' .*? '}' | '[' .*? ']';
mode COMMENT_MODE;
// If we match one of the followinf 3 rules, leave this comment mode
COMMENT_MODE_DEFINE : '#define' -> type(DEFINE), popMode;
COMMENT_MODE_SECTION : '@section' -> type(SECTION), popMode;
COMMENT_MODE_LINE_BREAK : [rn]+ -> skip, popMode;
// If none of the 3 rules above matched, consume a single
// character (which is part of the comment)
COMMENT_MODE_PART : ~[rn];
然后解析器可能如下所示:
parser grammar HeaderParser;
options { tokenVocab=HeaderLexer; }
compilationUnit
: statement* EOF
;
statement
: comment? pragmaDirective
| comment? defineDirective
| sectionLineComment
| comment
;
pragmaDirective
: PRAGMA char_sequence
;
defineDirective
: DEFINE IDENTIFIER BOOLEAN_LITERAL line_comment?
| DEFINE IDENTIFIER NUMBER line_comment?
| DEFINE IDENTIFIER char_sequence line_comment?
| DEFINE IDENTIFIER STRING line_comment?
| DEFINE IDENTIFIER ARRAY_SEQUENCE line_comment?
| DEFINE IDENTIFIER
;
sectionLineComment
: COMMENT_START COMMENT_MODE_PART? SECTION char_sequence
;
comment
: BLOCK_COMMENT
| line_comment
;
line_comment
: COMMENT_START COMMENT_MODE_PART*
;
char_sequence
: CHAR_SEQUENCE
| IDENTIFIER
;