使用 Selenium & Beautifulsoup 从 python 中的LinkedIn帖子中提取评论计数



我正在编写一个python脚本,该脚本通过web抓取从我自己的个人LinkedIn个人资料中提取性能数据。BeautifulSoup .

我能够通过Chrome成功访问我的个人资料并提取一些数据,但评论似乎很棘手。

到目前为止我写的是:

postComments = []
src = browser.page_source
#beautiful soup instance:
soup = BeautifulSoup(src, features="lxml")
bs4TagsComments = soup.find_all("li", attrs = {"class" : "social-details-social counts__item social-details-social-counts__comments"})
for tag in bs4TagsComments:
strtag = str(tag)
list_of_matches = re.findall('[,0-9]+',strtag)
last_string = list_of_matches.pop()
without_comma = last_string.replace(',','')
commentsCount = int(without_comma)
postComments.append(commentsCount)
print(postComments)

理论上,上面的操作应该可以工作——然而,打印出来的只是一个空列表。有评论计数可以拉,如果没有,我至少应该得到一个'0'的字典。

WithRegex能够提取评论的值。试试如下:

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import re
driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.implicitly_wait(10)
driver.maximize_window()
driver.get("https://www.linkedin.com/")
time.sleep(30) # to manually login
soup = BeautifulSoup(driver.page_source,'html5lib')
regex = re.compile('.*social-details-social-counts__comments.*')
comments = soup.find_all('li',{'class': regex}) # find all 'li' tags that has `social-details-social-counts__comments` in it.
for comment in comments:
value = comment.getText().replace('n','').replace(' ', '') #  for text without whitespaces
print(value)
1comment
1comment
14comments
5comments
29comments
4comments
3comments
...

根据文章提取评论数:

feeds = soup.find_all(code to find the feeds)
for feed in feeds:
regex = re.compile('.*social-details-social-counts__comments.*')
try:
comments = feed.find('li',{'class': regex}).getText().replace('n','').replace(' ', '')
except:
comments = None

最新更新