在Python中使用RegEx匹配嵌套括号之间的文本



我有一个大的CSV文件,其中一行如下:

id_85,
{
    "link": "some link",
    "icon": "hello.gif",
    "name": "Wall Photos",
    "comments": {
        "count": 0
    },
    "updated_time": "2012-03-12",
    "object_id": "400",
    "is_published": true,
    "properties": [
        {
            "text": "University",
            "name": "By",
            "href": "some link"
        }
    ],
    "from": {
        "id": "7778",
        "name": "Let"
    },
    "message": "Hello World! :D",
    "id": "id_85",
    "created_time": "2012-03-12",
    "to": {
        "data": [
            {
                "id": "100",
                "name": "March"
            }
        ]
    },
    "message_tags": {
        "0": [
            {
                "id": "100",
                "type": "user",
                "name": "Marcelo",
                "length": 7,
                "offset": 0
            }
        ]
    },
    "type": "photo",
    "caption": "Hello world!"
}

我试图只是得到它的json部分之间的第一个和结束花括号。

下面是我的python regex代码到目前为止

import re 
str = "id_85,{"link": "some link", "icon": "hello.gif", "name": "Wall Photos", "comments": {"count": 0}, "updated_time": "2012-03-12", "object_id": "400", "is_published": true, "properties": [{"text": "University", "name": "By", "href": "some link"}], "from": {"id": "777", "name": "Let"}, "message": "Hello World! :D", "id": "id_85", "created_time": "2012-03-12", "to": {"data": [{"id": "100", "name": "March"}]}, "message_tags": {"0": [{"id": "100", "type": "user", "name": "March", "length": 7, "offset": 0}]}, "type": "photo", "caption": "Hello world!"} "
m = re.match(r'.*,({.*}$)', str)
if m:
     print m.group(1)

在某些情况下,它不接受第一个和最后一个大括号,就像这样{…}。如何确保只包括第一个和最后一个大括号之间的文本,而不包括其他文本?

期望的输出看起来像这样:

{"链接":"一些链接"、"图标":"hello.gif"、"名称":"照片墙","评论":{"计数":0},"updated_time":"2012-03-12","object_id":"400", "is_published": true, "properties": [{"text": "University","名称":"通过","href":"一些链接"}],"从":{" id ": " 777 ", "名称":"Let"},"message":"Hello World!":D", "id": "id_85", "created_time":"2012-03-12",":{"数据":[{" id ": " 100 ", "名称":"三月"}]},"message_tags":{" 0 ":[{" id ": " 100 ","类型":"用户"、"名称":"3月","长度":7"抵消":0}]},"类型":"照片","标题":"你好世界!"}

谢谢!

这将匹配第一个逗号之后的整个json部分。不确定这是不是你想要的。一个期望输出的示例将会有所帮助。

re.match(r'[^,]*,(.*)', s).group(1)

我相信这是有效的,因为.*在这种情况下是"贪婪的":

import re
str = 'id_85,{"link": "some link", "icon": "hello.gif", "name": "Wall Photos", "comments": {"count": 0}, "updated_time": "2012-03-12", "object_id": "400", "is_published": true, "properties": [{"text": "University", "name": "By", "href": "some link"}], "from": {"id": "777", "name": "Let"}, "message": "Hello World! :D", "id": "id_85", "created_time": "2012-03-12", "to": {"data": [{"id": "100", "name": "March"}]}, "message_tags": {"0": [{"id": "100", "type": "user", "name": "March", "length": 7, "offset": 0}]}, "type": "photo", "caption": "Hello world!"} '
m = re.search('({.*})', str)
if m:
    print m.group(0)

这可能会占用太多如果你的CSV中有其他JSON字符串,它会太贪婪,因为最后的}将与str中最后出现的}匹配

注意,符号re.search(r'somregex', string) -即在regex之前添加的r -被称为"原始字符串符号"-通常在您希望将反斜杠按字面意思处理而不是作为regex特殊字符时使用。在这里看到的。例如,r'n'匹配两个字符n,而'n'将匹配换行字符

假设(如最初发布的)CSV中的每行有1个JSON元素,那么

re.match(r'^[^{]*({.*})[^}]*$',str).group(1)

就可以了。那就是:丢弃所有不是{的东西,直到你找到第一个,把所有跟随的东西,直到你击中},没有其他}在它之后进入一个组。

最新更新