我正在为一个过时已久的文本编辑器的脚本语言实现一个解释器,并且在使lexer正常工作时遇到了一些问题。
以下是该语言中有问题的部分的示例:
T
L /LOCATE ME/
C /LOCATE ME/CHANGED ME/ * *
C ;CHANGED ME;CHANGED ME AGAIN; 1 *
在sed
类型语法中,/
字符似乎引用字符串,也充当C
(CHANGE
)命令的分隔符,尽管它允许任何字符作为分隔符。
到目前为止,我可能已经实现了大约一半最常见的命令,仅使用parse_tokens(line.split())
。这既快又脏,但效果出奇地好。
为了避免编写自己的lexer,我尝试了shlex
。
它工作得很好,除了CHANGE
的情况:
import shlex
def shlex_test(cmd_str):
lex = shlex.shlex(cmd_str)
lex.quotes = '/'
return list(lex)
print(shlex_test('L /spaced string/'))
# OK! gives: ['L', '/spaced string/']
print(shlex_test('C /spaced string/another string/ * *'))
# gives : ['C', '/spaced string/', 'another', 'string/', '*', '*']
# desired : any format that doesn't split on a space between /'s
print(shlex_test('C ;a b;b a;'))
# gives : ['C', ';', 'b', 'a', ';', 'a', 'b', ';']
# desired : same format as CHANGE command above
有人知道实现这一点的简单方法吗(使用shlex
或其他方法)?
编辑:
如果有帮助的话,下面是帮助文件中给出的CHANGE
命令语法:
'''
C [/stg1/stg2/ [n|n m]]
The CHANGE command replaces the m-th occurrence of "stg1" with "stg2"
for the next n lines. The default value for m and n is 1.'''
同样难以标记化的X
和Y
命令:
'''
X [/command/[command/[...]]n]
Y [/command/[command/[...]]n]
The X and Y commands allow the execution of several commands contained
in one command. To define an X or Y "command string", enter X (or Y)
followed by a space, then individual commands, each separated by a
delimiter (e.g. a period "."). An unlimited number of commands may be
placed in the X or Y command string. Once the command string has been
defined, entering X (or Y) followed optionally by a count n will execute
the defined command string n times. If n is not specified, it will
default to 1.'''
问题可能是/
不代表引号,而仅代表分隔符。我猜第三个字符总是用来定义分隔符。此外,输出中不需要/
或;
,是吗?
我只对L和C命令用例进行了拆分:
>>> def parse(cmd):
... delim = cmd[2]
... return cmd.split(delim)
...
>>> c_cmd = "C /LOCATE ME/CHANGED ME/ * *"
>>> parse(c_cmd)
['C ', 'LOCATE ME', 'CHANGED ME', ' * *']
>>> c_cmd2 = "C ;a b;b a;"
>>> parse(c_cmd2)
['C ', 'a b', 'b a', '']
>>> l_cmd = "L /spaced string/"
>>> parse(l_cmd)
['L ', 'spaced string', '']
对于可选的" * *"
部分,您可以在最后一个列表元素上使用split(" ")
。
>>> parse(c_cmd)[-1].split(" ")
['', '*', '*']