如何解析多行之间的正则表达式文本和两个括号之间?



我是python的新手,并试图通过示例学习正则表达式。在这个例子中,我正在尝试从多行文本中提取字典部分。在下面的例子中,如何提取两个大括号之间的部分?

MWE:如何从这些数据中获得pandas数据框架?

import re
s = """
[
{
specialty: "Anatomic/Clinical Pathology",
one: " 12,643 ",
two: " 8,711 ",
three: " 385 ",
four: " 520 ",
five: " 3,027 ",
},
{
specialty: "Nephrology",
one: " 11,407 ",
two: " 9,964 ",
three: " 140 ",
four: " 316 ",
five: " 987 ",
},
{
specialty: "Vascular Surgery",
one: " 3,943 ",
two: " 3,586 ",
three: " 48 ",
four: " 13 ",
five: " 296 ",
},
]
"""
m = re.match('({.*})', s, flags=re.S)
data = m.groups()
df = pd.DataFrame(data)

我建议在键周围添加双引号,然后将字符串强制转换为字典列表,然后使用pd.from_dict:

简单地将结构读入pandas dataframe。
import pandas as pd
from ast import literal_eval
import re
s = "YOU STRING HERE"
fixed_s = re.sub(r"^(s*)(w+):", r'1"2":', s, flags=re.M)
df = pd.DataFrame.from_dict( ast.literal_eval(fixed_s) )

^(s*)(w+):regex在任何行开始处匹配零或多个空格(参见flags=re.M使^匹配任何行位置的开始),将它们捕获到组1,然后匹配一个或多个字字符,将它们捕获到组2,然后匹配:,然后将匹配替换为组1 +"+ Group 2 +":

使用ast.literal_eval将结果强制转换为字典列表。

然后,使用列表初始化数据框架。

最新更新