我有30B行。我的数据帧看起来像
age email
33 </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">.
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font-
family:"Calibri",sans-serif; color:black">Iam not interested.
Please unsubscribe me. </span></p><pclass="MsoNormal">
<spanstyle="font-family:"Calibri",sans-serif;color:black">
22 </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font-
family:"Calibri",sans-serif;color:black">Please share company
details</span></p><divclass="MsoNormal" align="center"style="text-
align:center"><hr size="2"width="98%" align="center"></div>
<pclass="MsoNormal">
43 </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font-
family:"Calibri",sans-serif;color:black">Can you send
some project info for west region ofIndia</span></p><p class="MsoNormal">
<spanstyle="font-family:"Calibri",sans-serif;color:black">
38 </style></head><bodylang="EN-IN" link="#0563C1"vlink="#954F72"><div
class="WordSection1"><pclass="MsoNormal"><span style="font-
family:"Calibri",sans-serif;color:black">Price of Mono perc</span>
</p><divclass="MsoNormal" align="center"style="text-align:center"><hr
size="2"width="98%" align="center"></div><pclass="MsoNormal"><b>
我的最终数据帧看起来像-
age email text
33 </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72">. Iam not interested.
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font- Please unsubscribe
family:"Calibri",sans-serif; color:black">Iam not interested. me.
Please unsubscribe me. </span></p><pclass="MsoNormal">
<spanstyle="font-family:"Calibri",sans-serif;color:black">
22 </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72"> Please share
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font- company details
family:"Calibri",sans-serif;color:black">Please share company
details</span></p><divclass="MsoNormal" align="center"style="text-
align:center"><hr size="2"width="98%" align="center"></div>
<pclass="MsoNormal">
43 </style></head><body lang="EN-IN"link="#0563C1" vlink="#954F72"> Can you send
<divclass="WordSection1"><p class="MsoNormal"><spanstyle="font- some project
family:"Calibri",sans-serif;color:black">Can you send info for west
some project info for west region ofIndia</span></p><p class="MsoNormal"> region ofIndia
<spanstyle="font-family:"Calibri",sans-serif;color:black">
38 </style></head><bodylang="EN-IN" link="#0563C1"vlink="#954F72"><div Price of Mono
class="WordSection1"><pclass="MsoNormal"><span style="font- perc
family:"Calibri",sans-serif;color:black">Price of Mono perc</span>
</p><divclass="MsoNormal" align="center"style="text-align:center"><hr
size="2"width="98%" align="center"></div><pclass="MsoNormal"><b>
我的代码看起来像-
word1 = "sans-serif; color:black">"
word2 = "</span></p>"
df['text'] = s.split(word1)[1].split(word2)[0]
这将返回单词1和单词2之间的文本。但目前不起作用。我的逻辑是从文本中提取邮件正文或信息,其中文本位于单词1和单词2之间。
使用BeautifulSoup
解析HTML
例如:
from bs4 import BeautifulSoup
df['text'] = df['email'].apply(lambda x: BeautifulSoup(x, "html.parser").find("p", class_="MsoNormal").text)
print(df)
输出:
0 Iam not interested. nPlease unsubscribe me.
1 Please share company ndetails
2 Can you send nsome project info for west regi...
3 Price of Mono percn
Name: text, dtype: object
根据注释进行编辑
def getText(val):
soup =BeautifulSoup(val, "html.parser")
try:
return soup.find("p", class_="MsoNormal").text
except:
return ""
df['text'] = df['email'].apply(getText)
尝试使用正则表达式进行选择:
sans-serif; color:black">(.*)</span></p>
regex101示例链接:https://regex101.com/r/oPihDO/4