我正在使用ANTLR来解析一些查询。
这是我的ANTLR g4:
propTest
: objectPath NOT? (EQ|NEQ) primitiveLiteral # propTestEqual
| objectPath NOT? (EQ|NEQ) 'wwww' # propTestThlEqual
;
primitiveLiteral
: orderableLiteral
| BoolLiteral
;
primitiveLiteral
: orderableLiteral
;
orderableLiteral
: StringLiteral
;
StringLiteral
: QUOTE ( ~['\] | '\'' | '\\' )* QUOTE
;
当我输入:
[network-traffic:src_port = '123]
我期望匹配发生在
: objectPath NOT? (EQ|NEQ) primitiveLiteral # propTestEqual
但是它不匹配任何东西但是一旦我删除
| objectPath NOT? (EQ|NEQ) 'wwww' # propTestThlEqual
则匹配发生在
: objectPath NOT? (EQ|NEQ) primitiveLiteral # propTestEqual
知道是怎么回事吗?
* *更新grammar STIXPattern;
pattern
: observationExpressions EOF
;
observationExpressions
: <assoc=left> observationExpressions FOLLOWEDBY observationExpressions #observationExpressionsFollowedBY
| observationExpressionOr #observationExpressionOr_
;
observationExpressionOr
: <assoc=left> observationExpressionOr OR observationExpressionOr #observationExpressionOred
| observationExpressionAnd #observationExpressionAnd_
;
observationExpressionAnd
: <assoc=left> observationExpressionAnd AND observationExpressionAnd #observationExpressionAnded
| observationExpression #observationExpression_
;
observationExpression
: LBRACK comparisonExpression RBRACK # observationExpressionSimple
| LPAREN observationExpressions RPAREN # observationExpressionCompound
| observationExpression startStopQualifier # observationExpressionStartStop
| observationExpression withinQualifier # observationExpressionWithin
| observationExpression repeatedQualifier # observationExpressionRepeated
;
comparisonExpression
: <assoc=left> comparisonExpression OR comparisonExpression #comparisonExpressionOred
| comparisonExpressionAnd #comparisonExpressionAnd_
;
comparisonExpressionAnd
: <assoc=left> comparisonExpressionAnd AND comparisonExpressionAnd #comparisonExpressionAnded
| propTest #comparisonExpressionAndpropTest
;
propTest
: objectPath NOT? (EQ|NEQ) primitiveLiteral # propTestEqual
| objectPath NOT? (EQ|NEQ) objectPathThl # propTestThlEqual
;
startStopQualifier
: START TimestampLiteral STOP TimestampLiteral
;
withinQualifier
: WITHIN (IntPosLiteral|FloatPosLiteral) SECONDS
;
repeatedQualifier
: REPEATS IntPosLiteral TIMES
;
objectPath
: objectType COLON firstPathComponent objectPathComponent?
;
objectPathThl
: varThlType DOT firstPathComponent objectPathComponent?
;
objectType
: IdentifierWithoutHyphen
| IdentifierWithHyphen
;
varThlType
: IdentifierWithoutHyphen
| IdentifierWithHyphen
;
firstPathComponent
: IdentifierWithoutHyphen
| StringLiteral
;
objectPathComponent
: <assoc=left> objectPathComponent objectPathComponent # pathStep
| '.' (IdentifierWithoutHyphen | StringLiteral) # keyPathStep
| LBRACK (IntPosLiteral|IntNegLiteral|ASTERISK) RBRACK # indexPathStep
;
setLiteral
: LPAREN RPAREN
| LPAREN primitiveLiteral (COMMA primitiveLiteral)* RPAREN
;
primitiveLiteral
: orderableLiteral
| BoolLiteral
;
orderableLiteral
: IntPosLiteral
| IntNegLiteral
| FloatPosLiteral
| FloatNegLiteral
| StringLiteral
| BinaryLiteral
| HexLiteral
| TimestampLiteral
;
IntNegLiteral :
'-' ('0' | [1-9] [0-9]*)
;
IntPosLiteral :
'+'? ('0' | [1-9] [0-9]*)
;
FloatNegLiteral :
'-' [0-9]* '.' [0-9]+
;
FloatPosLiteral :
'+'? [0-9]* '.' [0-9]+
;
HexLiteral :
'h' QUOTE TwoHexDigits* QUOTE
;
BinaryLiteral :
'b' QUOTE
( Base64Char Base64Char Base64Char Base64Char )*
( (Base64Char Base64Char Base64Char Base64Char )
| (Base64Char Base64Char Base64Char ) '='
| (Base64Char Base64Char ) '=='
)
QUOTE
;
StringLiteral :
QUOTE ( ~['\] | '\'' | '\\' )* QUOTE
;
BoolLiteral :
TRUE | FALSE
;
TimestampLiteral :
't' QUOTE
[0-9] [0-9] [0-9] [0-9] HYPHEN
( ('0' [1-9]) | ('1' [012]) ) HYPHEN
( ('0' [1-9]) | ([12] [0-9]) | ('3' [01]) )
'T'
( ([01] [0-9]) | ('2' [0-3]) ) COLON
[0-5] [0-9] COLON
([0-5] [0-9] | '60')
(DOT [0-9]+)?
'Z'
QUOTE
;
//////////////////////////////////////////////
// Keywords
AND: 'AND' ;
OR: 'OR' ;
NOT: 'NOT' ;
FOLLOWEDBY: 'FOLLOWEDBY';
LIKE: 'LIKE' ;
MATCHES: 'MATCHES' ;
ISSUPERSET: 'ISSUPERSET' ;
ISSUBSET: 'ISSUBSET' ;
LAST: 'LAST' ;
IN: 'IN' ;
START: 'START' ;
STOP: 'STOP' ;
SECONDS: 'SECONDS' ;
TRUE: 'true' ;
FALSE: 'false' ;
WITHIN: 'WITHIN' ;
REPEATS: 'REPEATS' ;
TIMES: 'TIMES' ;
// After keywords, so the lexer doesn't tokenize them as identifiers.
// Object types may have unquoted hyphens, but property names
// (in object paths) cannot.
IdentifierWithoutHyphen :
[a-zA-Z_] [a-zA-Z0-9_]*
;
IdentifierWithHyphen :
[a-zA-Z_] [a-zA-Z0-9_-]*
;
EQ : '=' | '==';
NEQ : '!=' | '<>';
LT : '<';
LE : '<=';
GT : '>';
GE : '>=';
QUOTE : ''';
COLON : ':' ;
DOT : '.' ;
COMMA : ',' ;
RPAREN : ')' ;
LPAREN : '(' ;
RBRACK : ']' ;
LBRACK : '[' ;
PLUS : '+' ;
HYPHEN : MINUS ;
MINUS : '-' ;
POWER_OP : '^' ;
DIVIDE : '/' ;
ASTERISK : '*';
fragment HexDigit: [A-Fa-f0-9];
fragment TwoHexDigits: HexDigit HexDigit;
fragment Base64Char: [A-Za-z0-9+/];
// Whitespace and comments
//
WS : [ trnu000Bu000Cu0085u00a0u1680u2000u2001u2002u2003u2004u2005u2006u2007u2008u2009u200au2028u2029u202fu205fu3000]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[rn]* -> skip
;
// Catch-all to prevent lexer from silently eating unusable characters.
InvalidCharacter
: .
;
不匹配因为'123
上没有结束的'
这是您的令牌流(对于您的示例)(我还包括错误消息)
[@0,0:0='[',<'['>,1:0]
[@1,1:15='network-traffic',<IdentifierWithHyphen>,1:1]
[@2,16:16=':',<':'>,1:16]
[@3,17:24='src_port',<IdentifierWithoutHyphen>,1:17]
[@4,26:26='=',<EQ>,1:26]
[@5,28:28=''',<'''>,1:28]
[@6,29:31='123',<IntPosLiteral>,1:29]
[@7,32:32=']',<']'>,1:32]
[@8,33:32='<EOF>',<EOF>,1:33]
line 1:28 no viable alternative at input 'network-traffic:src_port=''
与输入[network-traffic:src_port = '123']
匹配良好
(我添加了您的| objectPath NOT? (EQ | NEQ) 'wwww' # propTestThlEqual1
替代popTest,它匹配上面的字符串。
这是添加了'
的tokenStream
[@0,0:0='[',<'['>,1:0]
[@1,1:15='network-traffic',<IdentifierWithHyphen>,1:1]
[@2,16:16=':',<':'>,1:16]
[@3,17:24='src_port',<IdentifierWithoutHyphen>,1:17]
[@4,26:26='=',<EQ>,1:26]
[@5,28:32=''123'',<StringLiteral>,1:28]
[@6,33:33=']',<']'>,1:33]
[@7,34:33='<EOF>',<EOF>,1:34]
令牌规则将选择最长的匹配。
对你语法的评论…
你可能想让QUOTE成为一个片段,这样它就不能被识别为单独的标记(而只能在你引用它的Lexer规则中)(任何以大写字母开头的规则都是Lexer规则(习惯上Lexer规则都是大写的,但"重要"的是第一个字母)
如果我将QUOTE
规则更改为fragment QUOTE: ''';
则tokenStream为:(再次包含错误消息)
[@0,0:0='[',<'['>,1:0]
[@1,1:15='network-traffic',<IdentifierWithHyphen>,1:1]
[@2,16:16=':',<':'>,1:16]
[@3,17:24='src_port',<IdentifierWithoutHyphen>,1:17]
[@4,26:26='=',<EQ>,1:26]
[@5,28:28=''',<InvalidCharacter>,1:28]
[@6,29:31='123',<IntPosLiteral>,1:29]
[@7,32:32=']',<']'>,1:32]
[@8,33:32='<EOF>',<EOF>,1:33]
line 1:28 no viable alternative at input 'network-traffic:src_port=''
你得到相同的"没有可行的替代方案";错误,但您也会得到一个InvalidCharacter: .;
令牌,它有助于提示问题。
关于为什么在propTest规则上只有一个选择时得到不同结果的问题…这很有趣。当有单一规则时,然后我在您的示例中得到extraneous input ''' expecting {
警告,并在您的评论中得到第二个示例的mismatched input ']' expecting {
警告。
这两个都是ANTLR试图更好地恢复错误的结果。(参见:"从子程序错误中恢复";和"一连串的错误";在"确定的ANTLR 4参考"中;来自Pragmatic Programmers(几乎是"必须拥有")如果你打算用ANTLR做很多事情的话,请提前预定。现在看起来很明显,当ANTLR有多个规则替代方案时,它无法真正参与这些恢复尝试。(我确实看了ATN图,但它们并没有真正涵盖这些错误恢复路径,因此差异"无趣")
由于您只能在propTest解析器规则的单个可选版本中看到这些警告,因此处理它们实际上可能是"无关紧要的"。只需要处理no viable alternative
错误输入,然后继续。
通知你…如果您想使用一个选项来使用这些错误恢复策略,但要注意这些警告,您可以实现自己的ErrorListener
类。
我几乎总是这样做,这样我就能更好地控制捕获所有错误和警告,并决定如何在UI中管理它们。默认的ErrorHandler几乎只是将消息发送到控制台。