如何使用BeautifulSoup抓取WhatsApp表情符号?



我知道如何从WhatsApp中抓取表情符号,但前提是:

  1. 有一个单一的表情符号没有任何文字或
  1. 有文字带表情符号

但是当消息中有两个表情符号而没有任何文字时,我无法上网。这是消息"🎂">

的html
<div class="JwMbj i0jNr selectable-text copyable-text">
<span class="_3R6rC">
<img crossorigin="anonymous"
src="/img/d07f9aca6938f691b840f97dd1cd67dd_w_638-64.png" alt="🎂" draggable="false"
class="_2UdhN _1xeoG i0jNr selectable-text copyable-text" data-plain-text="🎂"
style="visibility: visible;">
</span>
</div>

和我尝试了这个代码来获取表情符号:

m = s.find_all('div', attrs={'class':'i0jNr'})
v = m.find('span', attrs={'class':'_3R6rC'})                         
for i in v.children:
if isinstance(i, NavigableString):
print(i)
elif isinstance(i, Tag):
print(i.attrs['alt'])

但是这段代码只在有单个表情符号时才有效,但是当消息中有两个表情符号时,它只打印一个,例如消息为"🔥🖐"输出为"🔥"(它只打印第一个表情符号)。这是消息

的html
<div class="JwMbj i0jNr selectable-text copyable-text">
<span class="_3R6rC">
<img crossorigin="anonymous"
src="/img/d07f9aca6938f691b840f97dd1cd67dd_w_1749-40.png" alt="🔥" draggable="false"
class="_2UdhN _3zyju i0jNr selectable-text copyable-text" data-plain-text="🔥"
style="visibility: visible;">
</span>
<span class="_3R6rC">
<img crossorigin="anonymous"
src="/img/d07f9aca6938f691b840f97dd1cd67dd_w_1845-40.png" alt="🖐" draggable="false"
class="_2UdhN _3zyju i0jNr selectable-text copyable-text" data-plain-text="🖐"
style="visibility: visible;">
</span>
</div>

我尝试了这个代码打印两个表情符号,但它不工作:

msglist = []
m = s.find_all('div', attrs={'class':'i0jNr'}) 
for b in m:
v = b.find_all('div', attrs={'class':'JwMbj'})   
for x in v:      
z = x.find_all('span', attrs={'class':'_3R6rC'})                
for i in z.children:
if isinstance(i, NavigableString):
print(i)
elif isinstance(i, Tag):
print(i.attrs['alt'])

但是没有输出。

您可以将<img>标记转换为纯文本,然后使用.get_text正常获取文本。例如:

from bs4 import BeautifulSoup
html_doc = """
<div class="JwMbj i0jNr selectable-text copyable-text">
<span class="_3R6rC">
<img crossorigin="anonymous"
src="/img/d07f9aca6938f691b840f97dd1cd67dd_w_1749-40.png" alt="🔥" draggable="false"
class="_2UdhN _3zyju i0jNr selectable-text copyable-text" data-plain-text="🔥"
style="visibility: visible;">
</span>
<span class="_3R6rC">
<img crossorigin="anonymous"
src="/img/d07f9aca6938f691b840f97dd1cd67dd_w_1845-40.png" alt="🖐" draggable="false"
class="_2UdhN _3zyju i0jNr selectable-text copyable-text" data-plain-text="🖐"
style="visibility: visible;">
</span>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
# select the main text div
text_div = soup.select_one(".copyable-text")
# convert all <img> to plain-text:
for img in text_div.select("img[data-plain-text]"):
img.replace_with(img["data-plain-text"])
# get text normally:
print(text_div.get_text(strip=True))

打印:

🔥🖐

最新更新