Python3脚本在抓取网站时跳过页面与beautifulsoup



我正在尝试使用Python3和Beautifulsoup抓取Glassdoor对微软的评论。虽然代码按预期工作,至少部分工作,但它随机跳过了一些页面,我不知道为什么。我的代码是这样的:

from bs4 import BeautifulSoup
import time
import csv
# Set a counter
i=1
# specify the URL of the website you want to scrape
url = "https://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651_P"+str(i)+".htm?filter.iso3Language=eng"
while True:
i = i+1
page = requests.get(url)
# if page.status_code != 200:
#   break
url = "https://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651_P"+str(i)+".htm?filter.iso3Language=eng"
# make a GET request to the website and retrieve the HTML content
response = requests.get(url)
time.sleep(0.5)
html = response.content

# parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

for category, match in zip(soup.find_all("p", class_="mb-0 strong"), 
soup.find_all("p", class_="mt-0 mb-0 pb v2__EIReviewDetailsV2__bodyColor v2__EIReviewDetailsV2__lineHeightLarge v2__EIReviewDetailsV2__isExpanded")):
reviews = match.span.text
proscons = category.text
print(proscons)
print(reviews)
print(i)
print()
if i>10:
break

输出如下所示:

Pros
Respect for employee needs, holidays generally calm and time off respected
3
[Skipped page 2, but all is as expected until page 4]
Cons
The Tech stack is narrow. Limited career opportunities.
4
Pros
The culture is VERY good
7
[Pages 5 and 6 were also skipped]

行为似乎是完全随机的,当我重新运行相同的代码时,不同的页面被解析,而其他页面被跳过。

提前感谢您的帮助!

因为对我来说有点不清楚你想要刮取什么数据值?但您可以尝试下一个示例,看它是否符合您的期望。

代码:

from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0'}
data = []
for page in range(1,51):
res = requests.get(f"https://www.glassdoor.com/Reviews/Microsoft-Reviews-E1651_P{page}.htm?filter.iso3Language=eng", headers = headers)
#print(res)
soup = BeautifulSoup(res.content, "html.parser")
for review in soup.select('div.gdReview'):

data.append({
"review_title": review.select_one('h2[class="mb-xxsm mt-0 css-93svrw el6ke055"] > a').get_text(strip=True),
"pros": review.select_one('span[data-test="pros"]').text,
'cons': review.select_one('span[data-test="cons"]').text,
"review_decision": review.select_one('div[class="common__EiReviewDetailsStyle__socialHelpfulcontainer pt-std"]').text
})
df = pd.DataFrame(data).to_csv('out.csv', index=False)
#print(df)

输出:

review_title  ...                           review_decision
0    Great company to work with to grow your archit...  ...  Be the first to find this review helpful
1                          Thoughts after 10 years....  ...     2172 people found this review helpful
2                                        Great company  ...  Be the first to find this review helpful
3                                        Great company  ...  Be the first to find this review helpful
4                          Fair employment environment  ...  Be the first to find this review helpful
..                                                 ...  ...                                       ...
495                                  Microsoft reviews  ...  Be the first to find this review helpful
496  Good place to coast; annoying place for engine...  ...  Be the first to find this review helpful
497                                            Not bad  ...  Be the first to find this review helpful
498                        Great Company/Great Culture  ...  Be the first to find this review helpful
499                                   Liked everything  ...  Be the first to find this review helpful
[500 rows x 4 columns]

最新更新