如何从文本文件中删除不需要的 html 标签?



我有一个包含一些随机电子邮件的txt文件。我的脚本每天多次将电子邮件拉入此文本文件,每封电子邮件都有<start><end>来区分它的开始和结束位置。我想清理我的文件并删除不需要的部分,这些部分主要是 html 标签,并仅将电子邮件正文部分字符串作为每封电子邮件的 txt 文件中的单独行保留。从我的文件中删除 html 标签以仅保留正文标签中包含的字符串的最佳方法是什么?

还有一种类型的电子邮件具有 Id 属性,不确定如何将其与电子邮件正文字符串一起存在(请参阅输出中的第一行.txt(。

myTxt.txt:

<start><html> <head>    <title>A random title</title>   <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> Hello World, Have a great day  Thanks <br><br> <hr> <br> <b>Details:</b><br><br> Name: John Doe<br><br> Email: johndoe@gmail.com<br><br> Secondary Name: Joe<br><br> Reference URL: <a href="https://some-url.com/Id=03415681&returnUrl=%2Fui%2F2%2Femail%2Faccount%3Ffind%3D" style="text-decoration: none; color: #08c;">/ui/2/email/account?Id=03415681&returnUrl=%2Fui%2F2%2Femail%2Faccount%3Ffind%3D</a><br> </body> <img src='https://path/to/img.gif?v=RL9lKY7Jm6AY0Gc3tHa9'/> </html> <end>
<start><div>Hello World, How are you?    Best.</div> <end>
<start>Hello World.<end>
<start>Hello World, this is my message.

Regards,
Jane
www.url.com
<end>
<start><html xmlns:o="urn:schemas-microsoft-com:office:office1" xmlns:w="urn:schemas-microsoft-com:office:word1" xmlns:m="http://schemas.microsoft.com/office/2004/121/omml" xmlns="http://www.w3.org/TR/REC-html401"><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><style><!-- /* Font Definitions */ @font-face   {font-family:"Cambria Math";    panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face   {font-family:DengXian;  panose-1:2 1 6 0 3 1 1 1 1 1;} @font-face   {font-family:Calibri;   panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face  {font-family:"@DengXian";  panose-1:2 1 6 0 3 1 1 1 1 1;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal     {margin:0cm;    margin-bottom:.0001pt;  font-size:11.0pt;   font-family:"Calibri",sans-serif;} a:link, span.MsoHyperlink    {mso-style-priority:99;     color:blue;     text-decoration:underline;} .MsoChpDefault  {mso-style-type:export-only;} @page WordSection1    {size:612.0pt 792.0pt;  margin:72.0pt 72.0pt 72.0pt 72.0pt;} div.WordSection1   {page:WordSection1;} --></style></head><body lang=EN-MY link=blue vlink="#954F72"><div class=WordSection1><p class=MsoNormal>Hello World, </p><p class=MsoNormal></p><p class=MsoNormal>This is my message. </p><p class=MsoNormal></p><p class=MsoNormal>Please reply when you can. </p><p class=MsoNormal></p><p class=MsoNormal>Thank you.<br>John</p><p class=MsoNormal>Sent from <a href="https://go.microsoft.com/fwlink/?LinkId=1234567890">Mail</a> for Windows 10</p><p class=MsoNormal><o:p> </o:p></p></div></body></html> <end>

期望输出.txt:

Hello World, Have a great day Thanks Id=0341568115681
Hello World, How are you? Best.
Hello World.
Hello World, this is my message. Regards, Jane www.url.com
Hello World, Please reply when you can. Thank you.John Sent from Mailfor Windows 10Â

到目前为止我的代码:

#adding <start> <end> tags to make clear separation between different emails and saving it to a file. 
#'emails' variable below contains all the emails that were captured when script ran
file = 'path/to/myTxt.txt'
start= '<start>'
end = '<end>'
toTXT = [start + s + end for s in emails]
with open(file, 'w') as f:
f.write("n".join(map(str, toTXT)))

有人可以帮忙吗?提前非常感谢!

这似乎有效:

>>> a = '''<start><html> <head>    <title>A random title</title>   <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> Hello World, Have a great day  Thanks <br><br> <hr> <br> <b>Details:</b><br><br> Name: John Doe<br><br> Email: johndoe@gmail.com<br><br> Secondary Name: Joe<br><br> Reference URL: <a href="https://some-url.com/Id=03415681&returnUrl=%2Fui%2F2%2Femail%2Faccount%3Ffind%3D" style="text-decoration: none; color: #08c;">/ui/2/email/account?Id=03415681&returnUrl=%2Fui%2F2%2Femail%2Faccount%3Ffind%3D</a><br> </body> <img src='https://path/to/img.gif?v=RL9lKY7Jm6AY0Gc3tHa9'/> </html> <end>
... <start><div>Hello World, How are you?    Best.</div> <end>
... <start>Hello World.<end>
... <start>Hello World, this is my message.
... '''
>>> import re
>>> print ' '.join([i.strip(' ') for i in re.split( r'<[^>]+>', a ) if len(i.strip(' ')) > 0])
A random title Hello World, Have a great day  Thanks Details: Name: John Doe Email: johndoe@gmail.com Secondary Name: Joe Reference URL: /ui/2/email/account?Id=03415681&returnUrl=%2Fui%2F2%2Femail%2Faccount%3Ffind%3D 
Hello World, How are you?    Best. 
Hello World. 
Hello World, this is my message.
>>> 

您可以使用此方法。

import re
def striphtml(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
print(striphtml("<h2>Some text</h2>"))

输出:Some text

最新更新