使用 Python 清理 HTML 内容



我正在使用一个外部API,它从HTML电子邮件向我发送文本。文本在没有 HTML 结构的情况下通过(例如<html>...</html>等(。我需要清理此文本并输出到 Slack。我尝试使用BeautifulSoup和Bleach,它们都不起作用,可能是由于输入中HTML的部分性质。

输入文本的示例如下所示:

&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Bacon ipsum dolor amet cupim meatball ham hock pancetta ball tip ribeye cow brisket bresaola short ribs drumstick short loin. Turkey pastrami boudin andouille fatback tenderloin pork beef jowl rump hamburger buffalo capicola prosciutto. Meatball jerky pig filet mignon cow. Tenderloin flank tongue venison. Spare ribs fatback jerky pig boudin biltong filet mignon pancetta capicola.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Jerky salami brisket, landjaeger beef ribs meatball swine alcatra. Pork chop doner kielbasa jowl biltong tri-tip. Sausage sirloin prosciutto ribeye meatball capicola andouille picanha rump bacon turkey kevin pancetta landjaeger jowl. Spare ribs burgdoggen landjaeger buffalo capicola cow corned beef flank frankfurter boudin salami t-bone doner. Kevin filet mignon ribeye, pork belly andouille chuck pig drumstick. Short ribs tri-tip ball tip rump flank.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Pig biltong doner fatback. Tail hamburger kielbasa pastrami buffalo boudin cupim, pig jerky prosciutto venison pork chop chuck sirloin kevin. Bresaola bacon drumstick ball tip salami ribeye capicola beef ribs. Meatball tenderloin drumstick bresaola rump short ribs. Salami venison chuck burgdoggen.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Strip steak ham prosciutto, biltong meatball kielbasa boudin shankle ground round bacon. Alcatra short loin chuck shankle hamburger shank, buffalo sausage turkey prosciutto tongue kielbasa venison. Shank cow turducken beef ribs meatloaf pork belly. Pastrami leberkas ball tip pancetta short loin sirloin turducken rump hamburger cupim strip steak ground round brisket filet mignon pork. Beef shankle kevin tail picanha bacon beef ribs cow ground round pig ham rump. Bresaola spare ribs tenderloin pastrami, ham jowl short loin hamburger shankle tail venison pig meatloaf.&lt;/div&gt;

我想要上述输入的以下输出:

Bacon ipsum dolor amet cupim meatball ham hock pancetta ball tip ribeye cow brisket bresaola short ribs drumstick short loin. Turkey pastrami boudin andouille fatback tenderloin pork beef jowl rump hamburger buffalo capicola prosciutto. Meatball jerky pig filet mignon cow. Tenderloin flank tongue venison. Spare ribs fatback jerky pig boudin biltong filet mignon pancetta capicola.
Jerky salami brisket, landjaeger beef ribs meatball swine alcatra. Pork chop doner kielbasa jowl biltong tri-tip. Sausage sirloin prosciutto ribeye meatball capicola andouille picanha rump bacon turkey kevin pancetta landjaeger jowl. Spare ribs burgdoggen landjaeger buffalo capicola cow corned beef flank frankfurter boudin salami t-bone doner. Kevin filet mignon ribeye, pork belly andouille chuck pig drumstick. Short ribs tri-tip ball tip rump flank.
Pig biltong doner fatback. Tail hamburger kielbasa pastrami buffalo boudin cupim, pig jerky prosciutto venison pork chop chuck sirloin kevin. Bresaola bacon drumstick ball tip salami ribeye capicola beef ribs. Meatball tenderloin drumstick bresaola rump short ribs. Salami venison chuck burgdoggen.
Strip steak ham prosciutto, biltong meatball kielbasa boudin shankle ground round bacon. Alcatra short loin chuck shankle hamburger shank, buffalo sausage turkey prosciutto tongue kielbasa venison. Shank cow turducken beef ribs meatloaf pork belly. Pastrami leberkas ball tip pancetta short loin sirloin turducken rump hamburger cupim strip steak ground round brisket filet mignon pork. Beef shankle kevin tail picanha bacon beef ribs cow ground round pig ham rump. Bresaola spare ribs tenderloin pastrami, ham jowl short loin hamburger shankle tail venison pig meatloaf.

我使用了以下简单的漂白剂例程:

def textify(html):
text = bleach.clean(html)
return text

使用BeautifulSoup,我还使用了一些正则表达式来清理输出:

def textify(html):
html = re.sub('<br>', 'n', html)
soup = BeautifulSoup(html)
text = soup.getText()
text = re.sub(r'&lt;', '<', text)
text = re.sub(r'&gt;', '>', text)
text = re.sub(r'&#39;', "'", text)
return text

在将它们传递给漂白剂或美汤之前,您首先需要使用标准库的 html 模块来取消转义字符串:

from html import unescape
html = "&lt;div style=&#39;bo...div&gt;"
unescaped_html = unescape(html)
text = bleach.clean(unescaped_html)
soup = BeautifulSoup(unescaped_html)

相关内容

  • 没有找到相关文章

最新更新