在 Python 中从 JSON 中剥离 HTML、内联样式和换行符、制表符字符



我有一个外部 JSON 文件,其中包含一些内联编码、HTML 标记、 和 \t 字符,我想删除所有这些东西,并希望只保留字符串而不破坏 JSON 格式到目前为止我已经尝试过这个并看到许多解决方案,但没有任何效果。非常感谢您的时间。这是我的代码

我正在使用python 3.x.x

import json, re
from html.parser import HTMLParser
def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)
with open('project-closedtasks-avgdaysopen.json') as f:
    data = json.load(f)
    data = json.dumps(data, indent=4)
print(data)

请注意,这是我正在输入的文件(从同一文件夹导入(并且我想要相同的输出但没有html标签,没有内联样式,没有或其他只有字符串的东西。

[
    {
        "idrfi" : 36809,
        "fkproject" : 33235,
        "subject" : "M2 - Flashing Clarifications",
        "description" : "<ol style="margin-left:0.375in">nt<li><span style="font-family:calibri; font-size:11pt">Refer to detail 5/A650 attached. Can the pre-finished metal panel be swapped for pre-finished metal flashing? This will allow the full assembly to be installed by the mechanical HVAC trade vs requiring the cladding trade to return for penthouse work. </span></li>n</ol>n",
        "response" : null
    },
    {
        "idrfi" : 36808,
        "fkproject" : 33139,
        "subject" : "M1 - Flashing Clarifications",
        "description" : "<ol style="margin-left:0.2in">nt<li><span style="font-family:calibri; font-size:11pt">Refer to detail 6/A612 attached. Clarify location of flashing on detail.</span></li>nt<li><span style="font-family:calibri; font-size:11pt">Refer to details 2,4/A614 attached. Clarify location of flashing on detail. </span></li>nt<li><span style="font-family:calibri; font-size:11pt">Refer to detail 3/A616 attached. Clarify location of flashing on detail.</span></li>nt<li><span style="font-family:calibri; font-size:11pt">Refer to detail 5/A650 attached. Can the pre-finished metal panel be swapped for pre-finished metal flashing? This will allow the full assembly to be installed by the mechanical HVAC trade vs requiring the cladding trade to return for penthouse work. </span></li>n</ol>n",
        "response" : null
    }
]

我找到了该功能,但我不知道如何实现它

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

编辑在此实现之后,&nbsp,,\t和其他东西没有删除我只想要字符串没有标签没有样式什么都没有

import json, re
from html.parser import HTMLParser
def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)
with open('project-closedtasks-avgdaysopen.json') as f:
    data = json.load(f)
    data = json.dumps(data, indent=4)
    removed_tags = remove_html_tags(data)
print(removed_tags)

只需调用您编写的函数

import json, re
from html.parser import HTMLParser
def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)
with open('project-closedtasks-avgdaysopen.json') as f:
    data = json.load(f)
    data = json.dumps(data, indent=4)
    removed_tags = remove_html_tags(data)
print(removed_tags)

我检查了一下,它工作正常。

最新更新