Python Lex-Yacc(PLY):不识别行开始或字符串开始



我是PLY的新手,比Python的初学者多一点。我正试图玩周围PLY-3.4和python 2.7学习它。请参阅下面的代码。我试图创建一个令牌QTAG,这是一个由更多空格的零组成的字符串,后面跟着'Q'或'Q',后面跟着'。'和一个正整数以及一个或多个空格。例如VALID QTAGs是

"Q.11 "
"  Q.12 "
"q.13     "
'''
   Q.14 
'''

无效的

"asdf Q.15 "
"Q.  15 "
下面是我的代码:
import ply.lex as lex
class LqbLexer:
     # List of token names.   This is always required
     tokens =  [
        'QTAG',
        'INT'
        ]

     # Regular expression rules for simple tokens
    def t_QTAG(self,t):
        r'^[ t]*[Qq].[0-9]+s+'
        t.value = int(t.value.strip()[2:])
        return t
    # A regular expression rule with some action code
    # Note addition of self parameter since we're in a class
    def t_INT(self,t):
    r'd+'
    t.value = int(t.value)   
    return t

    # Define a rule so we can track line numbers
    def t_newline(self,t):
        r'n+'
        print "Newline found"
        t.lexer.lineno += len(t.value)
    # A string containing ignored characters (spaces and tabs)
    t_ignore  = ' t'
    # Error handling rule
    def t_error(self,t):
        print "Illegal character '%s'" % t.value[0]
        t.lexer.skip(1)
    # Build the lexer
    def build(self,**kwargs):
        self.lexer = lex.lex(debug=1,module=self, **kwargs)
    # Test its output
    def test(self,data):
        self.lexer.input(data)
        while True:
             tok = self.lexer.token()
             if not tok: break
             print tok
# test it
q = LqbLexer()
q.build()
#VALID inputs
q.test("Q.11 ")
q.test("  Q.12 ")
q.test("q.13     ")
q.test('''
   Q.14 
''')
# INVALID ones are
q.test("asdf Q.15 ")
q.test("Q.  15 ")
我得到的输出如下:
LexToken(QTAG,11,1,0)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,12,1,4)
LexToken(QTAG,13,1,0)
Newline found
Illegal character 'Q'
Illegal character '.'
LexToken(INT,14,2,6)
Newline found
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'f'
Illegal character 'Q'
Illegal character '.'
LexToken(INT,15,3,7)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,15,3,4)

注意,只有第一个和第三个有效输入被正确标记。我无法弄清楚为什么我的其他有效输入没有被正确地标记。在t_QTAG的文档字符串中:

  1. 'A'代替'^'无效
  2. 我试过移除'^'。然后所有有效的输入都被标记化,然后第二步无效输入也会被标记化。

任何帮助是感激提前!

感谢

PS:我加入了google-group ply-hack并尝试在那里发帖,但我不能直接在论坛或通过电子邮件发帖。我不确定这个组织是否还活跃。比兹利教授也没有回应。什么好主意吗?

最后我自己找到了答案。把它贴出来,让别人觉得有用。

正如@Tadgh正确指出的那样,t_ignore = ' t'消耗空格和制表符,因此我将无法按照上述t_QTAG的正则表达式进行匹配,其结果是第二个有效输入没有被标记化。通过仔细阅读PLY文档,我了解到,如果要维护令牌的正则表达式的顺序,那么它们必须在函数中定义,而不是像t_ignore那样在字符串中定义。如果使用字符串,那么PLY会自动按最长到最短的长度对它们排序,并将它们附加在函数之后。这里t_ignore是特殊的,我猜,它以某种方式在其他任何东西之前执行。这部分没有清晰的文档说明。解决这个问题的方法是用一个新的标记定义一个函数,例如,t_SPACETAB t_QTAG之后,不返回任何东西。这样,现在所有有效的输入都被正确地标记了,除了带有三引号的输入(包含"Q.14"的多行字符串)。此外,根据规范,无效的是不标记的。

多行字符串问题:原来PLY内部使用re模块。在该模块中,默认情况下,^仅在字符串的开头解释,而不是在每行的开头解释。为了改变这种行为,我需要打开多行标志,这可以在使用(?m)的正则表达式中完成。因此,要正确处理我的测试中的所有有效和无效字符串,正确的正则表达式是:

r'(?m)^s*[Qq].[0-9]+s+'

下面是更正后的代码,添加了更多的测试:

import ply.lex as lex
class LqbLexer:
    # List of token names.   This is always required
    tokens = [
        'QTAG',
        'INT',
        'SPACETAB'
        ]

    # Regular expression rules for simple tokens
    def t_QTAG(self,t):
        # corrected regex
        r'(?m)^s*[Qq].[0-9]+s+'
        t.value = int(t.value.strip()[2:])
        return t
    # A regular expression rule with some action code
    # Note addition of self parameter since we're in a class
    def t_INT(self,t):
        r'd+'
        t.value = int(t.value)    
        return t
    # Define a rule so we can track line numbers
    def t_newline(self,t):
        r'n+'
        print "Newline found"
        t.lexer.lineno += len(t.value)
    # A string containing ignored characters (spaces and tabs)
    # Instead of t_ignore  = ' t'
    def t_SPACETAB(self,t):
        r'[ t]+'
        print "Space(s) and/or tab(s)"
    # Error handling rule
    def t_error(self,t):
        print "Illegal character '%s'" % t.value[0]
        t.lexer.skip(1)
    # Build the lexer
    def build(self,**kwargs):
        self.lexer = lex.lex(debug=1,module=self, **kwargs)
    # Test its output
    def test(self,data):
        self.lexer.input(data)
        while True:
             tok = self.lexer.token()
             if not tok: break
             print tok
# test it
q = LqbLexer()
q.build()
print "-============Testing some VALID inputs===========-"
q.test("Q.11 ")
q.test("  Q.12 ")
q.test("q.13     ")
q.test("""

   Q.14
""")
q.test("""
qewr
dhdhg
dfhg
   Q.15 asda
""")
# INVALID ones are
print "-============Testing some INVALID inputs===========-"
q.test("asdf Q.16 ")
q.test("Q.  17 ")

输出如下:

-============Testing some VALID inputs===========-
LexToken(QTAG,11,1,0)
LexToken(QTAG,12,1,0)
LexToken(QTAG,13,1,0)
LexToken(QTAG,14,1,0)
Newline found
Illegal character 'q'
Illegal character 'e'
Illegal character 'w'
Illegal character 'r'
Newline found
Illegal character 'd'
Illegal character 'h'
Illegal character 'd'
Illegal character 'h'
Illegal character 'g'
Newline found
Illegal character 'd'
Illegal character 'f'
Illegal character 'h'
Illegal character 'g'
Newline found
LexToken(QTAG,15,6,18)
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'a'
Newline found
-============Testing some INVALID inputs===========-
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'f'
Space(s) and/or tab(s)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,16,8,7)
Space(s) and/or tab(s)
Illegal character 'Q'
Illegal character '.'
Space(s) and/or tab(s)
LexToken(INT,17,8,4)
Space(s) and/or tab(s)

最新更新