python字符串的Regex版本

我想知道可以使用什么regex字符串来解析python字符串。在几次失败后，我得到了一个regex代码，它可以解析最常用的字符串格式之一，如

"this is "my string", which ends here"

这是我的正则表达式-"代码"：

"([^"\]|(\.))*"

我问这个问题是因为我以前在网上没有找到类似的东西。我可以使用该表达式并"开发"它来解析各种python字符串吗？如果你觉得这个问题很有趣，我推荐你，在那里你可以很快检查你的表情。

如果引号前有奇数>1个反斜杠，您的正则表达式模式(以及@thebjorn链接中的正则表达式)将失败，我建议您使用此模式(使用单行模式)：

"(?:[^"\]|\{2}|\.)*"

一种优化的方式：

"(?:(?=([^"\]+|\{2}|\.))1)*"

处理单一报价：

(["'])(?:[^"'\]|\{2}|\.|(?!1)["'])*1

或

(["'])(?:(?=([^"'\]+|\{2}|\.|(?!1)["']))2)*1

(请注意，四个图案的最后一个字符正好在同一条线上，一个符号？)

这里有一种不同的方法，它使用tokenize.generate_tokens来识别Python字符串。标记化模块使用regex；因此，通过使用tokenize，您可以将复杂的脏活留给Python本身。通过使用更高级的函数，您可以更加确信regex是正确的(并避免重新设计轮子)。此外，这将正确识别各种Python字符串(例如，单引号、双引号和三引号变体的字符串)，而不会被注释混淆。

import tokenize
import token
import io
import collections
class Token(collections.namedtuple('Token', 'num val start end line')):
@property
def name(self):
return token.tok_name[self.num]
text = r'''foo = 1 "this is "my string", which ends here" bar'''
for tok in tokenize.generate_tokens(io.BytesIO(text).readline):
tok = Token(*tok)            # 1
if tok.name == 'STRING':     # 2
print(tok.val)

tokenize.generate_tokens返回元组。Token类允许您以更好的方式访问元组中的信息
特别是，每个Token都有一个名称，例如"STRING"、"NEWLINE"，"INDENT"或"OP"。您可以使用它来识别Python字符串

Edit：我喜欢使用Token类，所以我不必编写CCD_ 2在很多地方。然而，对于上面的代码，忘记Token类，只显式地写下主要思想可能会更清楚、更容易：

import tokenize
import token
import io
text = r'''foo = 1 "this is "my string", which ends here" bar'''
for num, val, start, end, line in tokenize.generate_tokens(io.BytesIO(text).readline):
if token.tok_name[num]  == 'STRING': 
print(val)

这似乎能正确处理所有事情：

rr = r'''(?xi)
(r|u|ru|ur|)
(
''' (\. | [sS])*? '''
|
""" (\. | [sS])*? """
|
' (\. | [^'n])* '
|
" (\. | [^"n])* "
)
'''

测试：https://ideone.com/DEimLl

语法参考：http://docs.python.org/3/reference/lexical_analysis.html#string-和字节文字

相关内容

最新更新

热门标签：