从HTML中清除非正文文本



我想获取一些HTML文件中的电子邮件的文本。有时,HTML包含额外的信息,如css样式。这是一个文件示例:

< html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3-html40"><head><!-- Template generated by Exclaimer Mail Disclaimers on 09:48:42 Donnerstag, 2 Mai 2019 -->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css">P.ImprintUniqueID {
MARGIN: 0cm 0cm 0pt
}
LI.ImprintUniqueID {
MARGIN: 0cm 0cm 0pt
}
DIV.ImprintUniqueID {
MARGIN: 0cm 0cm 0pt
}
TABLE.ImprintUniqueIDTable {
MARGIN: 0cm 0cm 0pt
}
DIV.Section1 {
page: Section1
}
</style>
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v:* {behavior:url(#default#VML);}
o:* {behavior:url(#default#VML);}
w:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
p
{mso-style-priority:99;
mso-margin-top-alt:auto;
margin-right:0cm;
mso-margin-bottom-alt:auto;
margin-left:0cm;
font-size:12.0pt;
font-family:"Times New Roman",serif;}
span.E-MailFormatvorlage18
{mso-style-type:personal-compose;
font-family:"Arial",sans-serif;
color:black;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:70.85pt 70.85pt 2.0cm 70.85pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="DE" link="#0563C1" vlink="#954F72">
<p class="ImprintUniqueID"></p>
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><o:p>&nbsp;</o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="100%" style="width:100.0%;border-collapse:collapse">
<tbody>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">i.A.
<b>mdfdddfdf</b>&nbsp;<b>fdfdfd</b><o:p></o:p></span></p>
</td>
</tr>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:#7D7D7D">Euf<o:p></o:p></span></p>
</td>
</tr>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">&nbsp;<o:p></o:p></span></p>
</td>
</tr>
<tr style="height:18.75pt">
<td style="padding:0cm 0cm 0cm 0cm;height:18.75pt">
<p class="MsoNormal"><a href="randomURL" target="''"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:blue;text-decoration:none"><img border="0" width="173" height="25" id="_x0000_i1025" src="cid:image001.jpg@01D500CC.38133950" alt="RANDOM-LOGO"></span></a><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><o:p></o:p></span></p>
</td>
</tr>
<tr style="height:3.75pt">
<td style="padding:0cm 0cm 0cm 0cm;height:3.75pt"></td>
</tr>
<tr style="height:39.0pt">
<td style="padding:0cm 0cm 0cm 0cm;height:39.0pt">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">OLRAIT ....<br>
a name street Strasse<br>
51766 somewhere in , Germany<o:p></o:p></span></p>
</td>
</tr>
<tr style="height:3.75pt">
<td style="padding:0cm 0cm 0cm 0cm;height:3.75pt"></td>
</tr>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td nowrap="" style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">Tel:<o:p></o:p></span></p>
</td>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">&#43;another number<o:p></o:p></span></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black;display:none"><o:p>&nbsp;</o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">my number</span><span style="color:black"><o:p></o:p></span></p>
</td>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">&#43;a number</span><span style="color:black"><o:p></o:p></span></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black;display:none"><o:p>&nbsp;</o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="padding:0cm 0cm 0cm 0cm"></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<table class="MsoNormalTable" border="0" cellpadding="0">
<tbody>
<tr>
<td style="padding:.75pt .75pt .75pt .75pt"></td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black;display:none"><o:p>&nbsp;</o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="padding:0cm 0cm 0cm 0cm"></td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black;display:none"><o:p>&nbsp;</o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="padding:0cm 0cm 0cm 0cm"></td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black;display:none"><o:p>&nbsp;</o:p></span></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="padding:0cm 0cm 0cm 0cm"></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr style="height:3.75pt">
<td style="padding:0cm 0cm 0cm 0cm;height:3.75pt"></td>
</tr>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><a href="mailto:randomEMAIL@randomEMAIL.com" title="Click to send email to randomEMAIL"><span style="color:blue"> randomEMAIL </span></a><o:p></o:p></span></p>
</td>
</tr>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><a href="randomURL" title=""><span style="color:blue">www.fawema.com</span></a><o:p></o:p></span></p>
</td>
</tr>
<tr style="height:7.5pt">
<td style="padding:0cm 0cm 0cm 0cm;height:7.5pt"></td>
</tr>
<tr>
<td style="padding:0cm 0cm 0cm 0cm">
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black">Geschäftsführer: someNAMESr<br>
Handelsregister: randomURL 71761<br>
<br>
Bitte beachten Sie unsere Hinweise zum Datenschutz unter: <a href="randomURLIzo6I7nmlDCWwpK2F8C-adMC8Us" title="">
<span style="color:blue">www.randomURL.com</span></a><br>
Please find our information about data protection on: <a href="https://randomURL.com/index.php?atp_str=GD7FHAaZldBtYu4ZbiuQ5j0ju1Bz3V_-WJVhfSIvwKpNc7PkjwxvXWJ9N1ZYj4wxICa635o8b7ZYcrVXOGSir15tnxi2soe_ByWg05vb9Nx5D7wE08-DCfJ0za-gv6SH3MYY3OGuT5-ZO-eXZ1T5GbdEbyr5OE5_ofzIU4fCytSlKwS7OVZ6MrqVaMfXfc1AHnwigCkcGUgERcuUj8guuA8BY3huRL1aHmjQWKi1uHwr4CfaTN2qQVZhD9WLXQiuNEItrlQyjzk_NrekGaVk2lhC5JkeAamHHgtEQkvrXEBHVCM6OiMxYjU0YTA3ZDVhMWIjOjojqVc_-neMexkb6m2m7TQMYw" title="">
<span style="color:blue">www.randomURL.com</span></a><o:p></o:p></span></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><o:p>&nbsp;</o:p></span></p>
</div>
<p></p>
<p class="ImprintUniqueID">&nbsp;</p>
<font size="1" face="Arial">
<hr>
</font>
<p class="ImprintUniqueID"><br>
<font size="1" face="Arial">Diese E-Mail kann vertrauliche und/oder rechtlich geschützte Informationen enthalten. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten
Sie diese E-Mail und alle enthaltenen Anhänge. Das Öffnen der Anhänge, das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser E-Mail und des Anhanges sind nicht gestattet.<br>
<br>
</font><font size="1" face="Arial"></p>
<hr>
</font><font face="Arial"><br>
<font size="1">This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure
or distribution of the material in this e-mail is strictly forbidden.</font></font>
<p></p>
</body>
</html>

`

我阅读文本使用美丽的汤通过以下代码:

f=codecs.open(file, 'rb')
document= BeautifulSoup(f.read().decode('utf-8', 'ignore')).get_text().strip()

之后,我用漂白剂清洗了它

document = bleach.clean(document, strip=True)

然而,它并没有删除这个css样式的文本:

<style type="text/css">P.ImprintUniqueID {
MARGIN: 0cm 0cm 0pt
}
LI.ImprintUniqueID {
MARGIN: 0cm 0cm 0pt
}
DIV.ImprintUniqueID {
MARGIN: 0cm 0cm 0pt
}
TABLE.ImprintUniqueIDTable {
MARGIN: 0cm 0cm 0pt
}
DIV.Section1 {
page: Section1
}
</style>

我尝试使用regex来清理它,但不起作用:

regex = '(?s)<style>(.*?)</style>'
pattern = re.compile(regex)
document_clean = re.sub(pattern, '', document)

有什么想法吗?

注意<style>标记有一个type属性,即:

<style type="text/css">P.ImprintUniqueID {

因此,您需要对正则表达式稍微宽容一点。例如:

regex = '(?s)<style[>s](.*?)</style>'

这样可以确保您匹配的是style标记,而不是以style开头的某个标记,如<style2>(我编了这个标记名(。

最新更新