从html标签中提取原始邮件



我有30B行。我的数据帧看起来像

age                          email
33    </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">. 
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font- 
family:&quot;Calibri&quot;,sans-serif; color:black">Iam not interested. 
Please unsubscribe me.&nbsp;</span></p><pclass="MsoNormal">
<spanstyle="font-family:&quot;Calibri&quot;,sans-serif;color:black">&nbsp;
22    </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72"> 
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font- 
family:&quot;Calibri&quot;,sans-serif;color:black">Please share company 
details</span></p><divclass="MsoNormal" align="center"style="text- 
align:center"><hr size="2"width="98%" align="center"></div> 
<pclass="MsoNormal">
43    </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72"> 
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font- 
family:&quot;Calibri&quot;,sans-serif;color:black">Can you send 
some project info for west region ofIndia</span></p><p class="MsoNormal"> 
<spanstyle="font-family:&quot;Calibri&quot;,sans-serif;color:black">
38    </style></head><bodylang="EN-IN" link="#0563C1"vlink="#954F72"><div 
class="WordSection1"><pclass="MsoNormal"><span style="font- 
family:&quot;Calibri&quot;,sans-serif;color:black">Price of Mono perc</span> 
</p><divclass="MsoNormal" align="center"style="text-align:center"><hr 
size="2"width="98%" align="center"></div><pclass="MsoNormal"><b>

我的最终数据帧看起来像-

age                          email                                                   text
33    </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">.      Iam not interested. 
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font-        Please unsubscribe
family:&quot;Calibri&quot;,sans-serif; color:black">Iam not interested. me.
Please unsubscribe me.&nbsp;</span></p><pclass="MsoNormal">
<spanstyle="font-family:&quot;Calibri&quot;,sans-serif;color:black">&nbsp;
22    </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">         Please share 
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font-          company details
family:&quot;Calibri&quot;,sans-serif;color:black">Please share company 
details</span></p><divclass="MsoNormal" align="center"style="text- 
align:center"><hr size="2"width="98%" align="center"></div> 
<pclass="MsoNormal">
43    </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">           Can you send 
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font-            some project 
family:&quot;Calibri&quot;,sans-serif;color:black">Can you send            info for west 
some project info for west region ofIndia</span></p><p class="MsoNormal">  region ofIndia
<spanstyle="font-family:&quot;Calibri&quot;,sans-serif;color:black">
38    </style></head><bodylang="EN-IN" link="#0563C1"vlink="#954F72"><div         Price of Mono
class="WordSection1"><pclass="MsoNormal"><span style="font-                 perc
family:&quot;Calibri&quot;,sans-serif;color:black">Price of Mono perc</span> 
</p><divclass="MsoNormal" align="center"style="text-align:center"><hr 
size="2"width="98%" align="center"></div><pclass="MsoNormal"><b>

我的代码看起来像-

word1 = "sans-serif; color:black">"
word2 = "</span></p>"
df['text'] = s.split(word1)[1].split(word2)[0]

这将返回单词1和单词2之间的文本。但目前不起作用。我的逻辑是从文本中提取邮件正文或信息,其中文本位于单词1和单词2之间。

使用BeautifulSoup解析HTML

例如:

from bs4 import BeautifulSoup
df['text'] = df['email'].apply(lambda x: BeautifulSoup(x, "html.parser").find("p", class_="MsoNormal").text)
print(df)

输出:

0        Iam not interested. nPlease unsubscribe me. 
1                       Please share company ndetails
2    Can you send nsome project info for west regi...
3                                 Price of Mono percn
Name: text, dtype: object

根据注释进行编辑

def getText(val):
soup =BeautifulSoup(val, "html.parser")
try:
return soup.find("p", class_="MsoNormal").text
except:
return ""
df['text'] = df['email'].apply(getText)

尝试使用正则表达式进行选择:

sans-serif; color:black">(.*)</span></p>

regex101示例链接:https://regex101.com/r/oPihDO/4

最新更新