如何解析看起来像 JSON 但不是的文件



我正在尝试解析python中的文件(文件名.inc(,如下所示:

a: 2: {
s: 3: "somestuff";
a: 14: {
i: 601600;
a: 6: {
i: 559;
a: 4: {
s: 5: "label";
s: 3: "somelabel";
s: 2: "id";
s: 3: "559";
s: 10: "timestart";
s: 16: "01 01 1970 00:00";
s: 8: "timestop";
s: 16: "24 01 2020 20:55";
}
i: 18158;
a: 4: {
s: 5: "label";
s: 12: "someotherlabel";
s: 2: "id";
s: 5: "18158";
s: 10: "timestart";
s: 16: "01 01 1970 00:00";
s: 8: "timestop";
s: 16: "25 01 2020 18:55";
}
i: 10402;
a: 4: {
s: 5: "label";
s: 3: "newlabel";
s: 2: "id";
s: 5: "10402";
s: 10: "timestart";
s: 16: "01 01 1970 00:00";
s: 8: "timestop";
s: 16: "26 01 2020 06:55";
}

等等... 我尝试使用:

import json
with open('filename.inc') as json_file:
data = json.load(json_file)

但得到: 值错误:无法解码任何 JSON 对象

我试图删除第一个冒号,添加引号,用逗号代替分号:

"a2": {
"s3": "somestuff",
"a14": {
"i": 601600,
"a6": {
"i": 559,
"a4": {
"s5": "label",
"s3": "somelabel",
"s2": "id",
"s3": "559",
"s10": "timestart",
"s16": "01 01 1970 00:00",
"s8": "timestop",
"s16": "24 01 2020 20:55",
}
"i": 18158,
"a4": {
"s5": "label",
"s12": "someotherlabel",
"s2": "id",
"s5": "18158",
"s10": "timestart",
"s16": "01 01 1970 00:00",
"s8": "timestop",
"s16": "25 01 2020 18:55",
}
"i": 10402,
"a4": {
"s5": "label",
"s3": "newlabel",
"s2": "id",
"s5": "10402",
"s10": "timestart",
"s16": "01 01 1970 00:00",
"s8": "timestop",
"s16": "26 01 2020 06:55",
}

但这给了我多个具有相同 ID 的密钥...... 考虑将其转换为带有标签的html文件,以使用beautifulsoup进行解析,但对于这样的文件来说似乎太复杂了。 我将不胜感激任何提示,提前感谢。

我检查了一下,空格会损害PHP的原生serialize以及Python phpserialize。您执行的"清理"无论如何都会使其成为无效转储(例如s: 3: "somestuff"是非法的,它编码 3 个字符的字符串"somestuff",其中"somestuff"显然不是 3 个字符长(,所以我必须构建我自己的示例:

source = """
a: 2: {
i: 0;
s: 3: "foo";
i: 1;
s: 4: "quux";
};
"""
import re
import phpserialize     # requires: pip install phpserialize
cleanup_re = re.compile('(".*?")|s+')
clean_source = cleanup_re.sub(lambda m: m.group(0) if m.group(1) else "", source)
data = phpserialize.loads(bytes(clean_source, 'utf8'))

仅当没有字符串中包含双引号时,这才有效;如果不编写适当的解析器,我想不出一种方法来做到这一点。

最新更新