如何在抓取评论时防止重复



我从Glassdoor.com上抓取评论,每次运行代码时都会得到重复的页面。例如,第一页被刮了两次。

如果你检查代码,从原来打印在终端上,你会注意到第2页和第3页都有20。这意味着它跳过了第3页的评论,因此,第2页是重复的。

有什么解决方案吗?

import requests
from bs4 import BeautifulSoup
import pandas as pd #we will need a datset 

headers = {'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.80 Safari/537.36'}
ReviewsList = []
def extract(pg): 

url = f'https://www.glassdoor.com/Reviews/Meta-Information-Technology-Reviews-EI_IE40772.0,4_DEPT1011_IP{pg}.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng&filter.employmentStatus=REGULAR&filter.employmentStatus=PART_TIME'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')# this a soup function that retuen the whole html
#to get reviews
divs = soup.find_all('div', class_='gdReview')
try:
for item in divs:       
Title = item.find('h2', class_= 'mb-xxsm mt-0 css-93svrw el6ke055').text.strip()
Rating = item.find('span', class_= 'ratingNumber mr-xsm').text.replace('<span class="ratingNumber mr-xsm">', '').strip()
Employee_Situation= item.find('span', class_= 'pt-xsm pt-md-0 css-1qxtz39 eg4psks0').text.replace('<span class="pt-xsm pt-md-0 css-1qxtz39 eg4psks0">', '').strip()
Pros = item.find('span',  {'data-test':'pros'}).text.replace('<span data-test="pros">', '').strip()
Cons = item.find('span',  {'data-test':'cons'}).text.replace('<span data-test="cons">', '')
Author_Info = item.find('span', class_= 'common__EiReviewDetailsStyle__newUiJobLine').text.replace('<span class="common__EiReviewDetailsStyle__newUiJobLine"><span><span class="middle common__EiReviewDetailsStyle__newGrey">', '').strip()


Reviews = {
'Title' : Title,
'Rating': Rating,
'Employee_Situation' : Employee_Situation,
'Pros' : Pros,
'Cons' : Cons, 'Auhtor_Info' : Author_Info,

} 
ReviewsList.append(Reviews) 
return
except:
pass


#loop into pages
for i in range(1,10,1):
soup = extract( f'https://www.glassdoor.com/Reviews/Meta-Information-Technology-Reviews-EI_IE40772.0,4_DEPT1011_IP{i}.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng&filter.employmentStatus=REGULAR&filter.employmentStatus=PART_TIME')
print(f' page {i}')  
extract(soup)
print(len(ReviewsList))

df = pd.DataFrame(ReviewsList)
df.to_csv('GlassdoorReviews2.csv')
print(len(ReviewsList))

这是在终端上打印的:

page 1
10
page 2
20
page 3
20
page 4
30
page 5
50
page 6
70
page 7
80
page 8
80
page 9
100
100

我以前也遇到过同样的问题,通过使用一个只保留唯一值的集合,我很容易地解决了这个问题
这里是一个集合的例子

myset = {"apple", "banana", "cherry"} # i used string here as an example you can use any other object 
myset.add("apple") # here i added duplicated value
print(myset)

这是输出

{'apple', 'cherry', 'banana'}

最新更新