我有一个大的CSV文件,其中一行如下:
id_85,
{
"link": "some link",
"icon": "hello.gif",
"name": "Wall Photos",
"comments": {
"count": 0
},
"updated_time": "2012-03-12",
"object_id": "400",
"is_published": true,
"properties": [
{
"text": "University",
"name": "By",
"href": "some link"
}
],
"from": {
"id": "7778",
"name": "Let"
},
"message": "Hello World! :D",
"id": "id_85",
"created_time": "2012-03-12",
"to": {
"data": [
{
"id": "100",
"name": "March"
}
]
},
"message_tags": {
"0": [
{
"id": "100",
"type": "user",
"name": "Marcelo",
"length": 7,
"offset": 0
}
]
},
"type": "photo",
"caption": "Hello world!"
}
我试图只是得到它的json部分之间的第一个和结束花括号。
下面是我的python regex代码到目前为止
import re
str = "id_85,{"link": "some link", "icon": "hello.gif", "name": "Wall Photos", "comments": {"count": 0}, "updated_time": "2012-03-12", "object_id": "400", "is_published": true, "properties": [{"text": "University", "name": "By", "href": "some link"}], "from": {"id": "777", "name": "Let"}, "message": "Hello World! :D", "id": "id_85", "created_time": "2012-03-12", "to": {"data": [{"id": "100", "name": "March"}]}, "message_tags": {"0": [{"id": "100", "type": "user", "name": "March", "length": 7, "offset": 0}]}, "type": "photo", "caption": "Hello world!"} "
m = re.match(r'.*,({.*}$)', str)
if m:
print m.group(1)
在某些情况下,它不接受第一个和最后一个大括号,就像这样{…}。如何确保只包括第一个和最后一个大括号之间的文本,而不包括其他文本?
期望的输出看起来像这样:
{"链接":"一些链接"、"图标":"hello.gif"、"名称":"照片墙","评论":{"计数":0},"updated_time":"2012-03-12","object_id":"400", "is_published": true, "properties": [{"text": "University","名称":"通过","href":"一些链接"}],"从":{" id ": " 777 ", "名称":"Let"},"message":"Hello World!":D", "id": "id_85", "created_time":"2012-03-12",":{"数据":[{" id ": " 100 ", "名称":"三月"}]},"message_tags":{" 0 ":[{" id ": " 100 ","类型":"用户"、"名称":"3月","长度":7"抵消":0}]},"类型":"照片","标题":"你好世界!"}
谢谢!
这将匹配第一个逗号之后的整个json部分。不确定这是不是你想要的。一个期望输出的示例将会有所帮助。
re.match(r'[^,]*,(.*)', s).group(1)
我相信这是有效的,因为.*
在这种情况下是"贪婪的":
import re
str = 'id_85,{"link": "some link", "icon": "hello.gif", "name": "Wall Photos", "comments": {"count": 0}, "updated_time": "2012-03-12", "object_id": "400", "is_published": true, "properties": [{"text": "University", "name": "By", "href": "some link"}], "from": {"id": "777", "name": "Let"}, "message": "Hello World! :D", "id": "id_85", "created_time": "2012-03-12", "to": {"data": [{"id": "100", "name": "March"}]}, "message_tags": {"0": [{"id": "100", "type": "user", "name": "March", "length": 7, "offset": 0}]}, "type": "photo", "caption": "Hello world!"} '
m = re.search('({.*})', str)
if m:
print m.group(0)
这可能会占用太多如果你的CSV中有其他JSON字符串,它会太贪婪,因为最后的}
将与str
中最后出现的}
匹配
注意,符号re.search(r'somregex', string)
-即在regex之前添加的r
-被称为"原始字符串符号"-通常在您希望将反斜杠按字面意思处理而不是作为regex特殊字符时使用。在这里看到的。例如,r'n'
匹配两个字符和
n
,而'n'
将匹配换行字符
假设(如最初发布的)CSV中的每行有1个JSON元素,那么
re.match(r'^[^{]*({.*})[^}]*$',str).group(1)
就可以了。那就是:丢弃所有不是{
的东西,直到你找到第一个,把所有跟随的东西,直到你击中}
,没有其他}
在它之后进入一个组。