HTML文件中的删除线路中断



我有一个html文件,我需要删除身体标签之间的所有线路断裂

<HTML>
  <HEAD>
    <TITLE>
    </TITLE>
  </HEAD>
<BODY>
  <P></P>
  <P></P>
</BODY>
</HTML>

获得它

<HTML>
  <HEAD>
    <TITLE>
    </TITLE>
  </HEAD>
<BODY><P></P><P></P></BODY>
</HTML>

尝试将整个HTML放入字符串中并执行此操作。

bodystring = htmlstring[htmlstring.index('<BODY>'):htmlstring.index('</BODY>')+7]
htmlstring = htmlstring.replace(bodystring, bodystring.replace('n',''))
file_content = open('name.html', 'r').read()
start_index, end_index = file_content.index("<BODY>"), file_content.index("</BODY>")
head , body_content, tail = file_content[:start_index], file_content[start_index:end_index], file_content[end_index:]
new_html = head + body_content.replace("n", "") + tail
file_content = open('name.html', 'w')
file_content.write(new_html)

这是有点自制的,不使用外部库:(假设您的文件是foo.html

with open('foo.html') as f:
    html_file = f.readlines()
body_index = []
for line in html_file :
    if 'BODY' in line :
        body_index.append(html_file.index(line))
start, end = body_index
start += 1
for i in range(start, end) :
    if 'n' in html_file[i] :
        html_file[i] = html_file[i].replace('n', '')

完成

最新更新