ANTLR4 Python解析大文件

我正在尝试为瞻博网络/srx 路由器访问控制列表编写解析器。以下是我正在使用的语法：

grammar SRXBackend;
acl:
    'security' '{' 'policies' '{' COMMENT* replaceStmt '{' policy* '}' '}' '}'
            applications
            addressBook
;
replaceStmt:
    'replace:' IDENT
|   'replace:' 'from-zone' IDENT 'to-zone' IDENT
;
policy:
    'policy' IDENT '{' 'match' '{' fromStmt* '}' 'then' (action | '{' action+ '}') '}'
;
fromStmt:
     'source-address' addrBlock                     # sourceAddrStmt
|    'destination-address' addrBlock                # destinationAddrStmt
|    'application' (srxName ';' | '[' srxName+ ']')  # applicationBlock
;
action:
    'permit' ';'
|   'deny' ';'
|   'log { session-close; }'
;
addrBlock:
    '[' srxName+ ']'
|   srxName ';'
;
applications:
    'applications' '{' application* '}'
|   'applications' '{' 'apply-groups' IDENT ';' '}' 'groups' '{' replaceStmt  '{' 'applications' '{' application* '}' '}' '}'
;
addressBook:
    'security' '{' 'address-book' '{' replaceStmt '{' addrEntry* '}' '}' '}'
|   'groups' '{' replaceStmt  '{' 'security' '{' 'address-book' '{' IDENT '{' addrEntry* '}' '}' '}' '}' '}' 'security' '{' 'apply-groups' IDENT ';' '}'
;
application:
    'replace:'? 'application' srxName '{' applicationStmt+ '}'
;
applicationStmt:
    'protocol' srxName ';'            #applicationProtocol
|   'source-port' portRange ';'       #applicationSrcPort
|   'destination-port' portRange ';'  #applicationDstPort
;
portRange:
    NUMBER             #portRangeOne
|   NUMBER '-' NUMBER  #portRangeMinMax
;
addrEntry:
    'address-set' IDENT '{' addrEntryStmt+ '}' #addrEntrySet
|   'address' srxName cidr ';'                 #addrEntrySingle
;
addrEntryStmt:
    ('address-set' | 'address') srxName ';'
;
cidr:
    NUMBER '.' NUMBER '.' NUMBER '.' NUMBER ('/' NUMBER)?
;
srxName:
    NUMBER
|   IDENT
|   cidr
;
COMMENT : '/*' .*? '*/' ;
NUMBER  : [0-9]+ ;
IDENT   : [a-zA-Z][a-zA-Z0-9,-_:./]* ;
WS      : [ tn]+ -> skip ;

当我尝试使用具有 ~80,000 行的 ACL 时，生成解析树最多需要 ~10 分钟。我使用以下代码来创建解析树：

from antlr4 import *
from SRXBackendLexer import SRXBackendLexer
from SRXBackendParser import SRXBackendParser
import sys

    def main(argv):
        ipt = FileStream(argv[1])
        lexer = SRXBackendLexer(ipt)
        stream = CommonTokenStream(lexer)
        parser = SRXBackendParser(stream)
        parser.acl()
    if __name__ == '__main__':
        main(sys.argv)

我正在使用Python 2.7作为目标语言。我还运行了cProfile来确定哪些代码花费的时间最多。以下是按时间排序的前几条记录：

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   608448   62.699    0.000  272.359    0.000 LexerATNSimulator.py:152(execATN)
  5007036   41.253    0.000   71.458    0.000 LexerATNSimulator.py:570(consume)
  5615722   32.048    0.000   70.416    0.000 DFAState.py:131(__eq__)
 11230968   24.709    0.000   24.709    0.000 InputStream.py:73(LA)
  5006814   21.881    0.000   31.058    0.000 LexerATNSimulator.py:486(captureSimState)
  5007274   20.497    0.000   29.349    0.000 ATNConfigSet.py:160(__eq__)
 10191162   18.313    0.000   18.313    0.000 {isinstance}
 10019610   16.588    0.000   16.588    0.000 {ord}
  5615484   13.331    0.000   13.331    0.000 LexerATNSimulator.py:221(getExistingTargetState)
  6832160   12.651    0.000   12.651    0.000 InputStream.py:52(index)
  5007036   10.593    0.000   10.593    0.000 InputStream.py:67(consume)
   449433    9.442    0.000  319.463    0.001 Lexer.py:125(nextToken)
        1    8.834    8.834   16.930   16.930 InputStream.py:47(_loadString)
   608448    8.220    0.000  285.163    0.000 LexerATNSimulator.py:108(match)
  1510237    6.841    0.000   10.895    0.000 CommonTokenStream.py:84(LT)
   449432    6.044    0.000  363.766    0.001 Parser.py:344(consume)
   449433    5.801    0.000    9.933    0.000 Token.py:105(__init__)

除了 InputStream.LA 大约需要半分钟之外，我真的无法理解它。我想这是因为整个文本字符串一次被缓冲/加载。是否有任何替代/更懒惰的方式来解析或加载 Python 目标的数据？我可以对语法进行任何改进以加快解析速度吗？

谢谢

我的理解是，由于*而不是+，您的IDENT可以是零大小。这会将解析器发送到每个字符的循环中，生成零大小的IDENT节点。

试试这个：

# Install ANTLR and the Python runtime for ANTLR
!pip install antlr4-python3-runtime
# Generate a parser from the grammar
!antlr4 -Dlanguage=Python SRXBackend.g4
# Import the generated parser and lexer
from SRXBackendParser import SRXBackendParser
from SRXBackendLexer import SRXBackendLexer
# Read the input file
with open("input.txt", "r") as file:
    input_str = file.read()
# Create a lexer and parser
lexer = SRXBackendLexer(InputStream(input_str))
stream = CommonTokenStream(lexer)
parser = SRXBackendParser(stream)
# Parse the input
tree = parser.acl()

相关内容

最新更新

热门标签：