美丽汤按搜索类提取数据



下面的代码从下面的网站的评论中提取Arbeitsatmosphare和Stadt数据。但是提取是基于索引方法的,所以如果我们不想提取Arteitsatmosphare,而是提取图像(rating_tags[12](,我们将有错误,因为有时我们只有2或3个项目在审查。

我想更新此代码以得到以下输出。如果我们没有图像,请使用 0 或 n/a。

Arbeitsatmosphare | Stadt     | Image | 
1.      4.00            | Berlin    | 4.00  |
2.      5.00            | Frankfurt | 3.00  |
3.      3.00            | Munich    | 3.00  |
4.      5.00            | Berlin    | 2.00  |
5.      4.00            | Berlin    | 5.00  |

我的代码如下

import requests
from bs4 import BeautifulSoup
import pandas as  pd
arbeit = []
stadt = []
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'https://www.kununu.com/de/volkswagenconsulting/kommentare/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
print("Number of articles: " + str(len(articles)))
for article in articles:
rating_tags = article.find_all('span', {'class' : 'rating-badge'})
arbeit.append(rating_tags[0].text.strip())

detail_div = article.find_all('div', {'class' : 'review-details'})[0]
nodes = detail_div.find_all('li')
stadt_node = nodes[1]
stadt_node_div = stadt_node.find_all('div')
stadt_name = stadt_node_div[1].text.strip()
stadt.append(stadt_name)
page += 1
pagination = soup.find_all('div', {'class' : 'paginationControl'})
if not pagination:
break
df = pd.DataFrame({'Arbeitsatmosphäre' : arbeit, 'Stadt' : stadt})
print(df)

您可以使用try/except.

import requests
from bs4 import BeautifulSoup
import pandas as  pd
import re
arbeit = []
stadt = []
image = []
with requests.Session() as session:
session.headers = {
'x-requested-with': 'XMLHttpRequest'
}
page = 1
while True:
print(f"Processing page {page}..")
url = f'https://www.kununu.com/de/volkswagenconsulting/kommentare/{page}'
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
print("Number of articles: " + str(len(articles)))
for article in articles:
rating_tags = article.find_all('span', {'class' : 'rating-badge'})
arbeit.append(rating_tags[0].text.strip())
try:
imageText = article.find('span', text=re.compile(r'Image')).find_next('span').text.strip()
image.append(imageText)
except:
image.append('N/A')

detail_div = article.find_all('div', {'class' : 'review-details'})[0]
nodes = detail_div.find_all('li')
stadt_node = nodes[1]
stadt_node_div = stadt_node.find_all('div')
stadt_name = stadt_node_div[1].text.strip()
stadt.append(stadt_name)
page += 1
pagination = soup.find_all('div', {'class' : 'paginationControl'})
if not pagination:
break
df = pd.DataFrame({'Arbeitsatmosphäre' : arbeit, 'Stadt' : stadt, 'Image': image})
print(df)

输出:

Processing page 1..
Number of articles: 10
Processing page 2..
Number of articles: 10
Processing page 3..
Number of articles: 10
Processing page 4..
Number of articles: 4
Arbeitsatmosphäre      Stadt Image
0               5,00  Wolfsburg  4,00
1               5,00  Wolfsburg  4,00
2               5,00  Wolfsburg  5,00
3               5,00  Wolfsburg  4,00
4               2,00  Wolfsburg  2,00
5               5,00  Wolfsburg  5,00
6               5,00  Wolfsburg  5,00
7               5,00  Wolfsburg  4,00
8               5,00  Wolfsburg  4,00
9               5,00  Wolfsburg  5,00
10              5,00  Wolfsburg  4,00
11              5,00  Wolfsburg  5,00
12              5,00  Wolfsburg  5,00
13              4,00  Wolfsburg  4,00
14              4,00  Wolfsburg  4,00
15              4,00  Wolfsburg  4,00
16              5,00  Wolfsburg  5,00
17              3,00  Wolfsburg  5,00
18              5,00  Wolfsburg  4,00
19              5,00  Wolfsburg  5,00
20              5,00  Wolfsburg  4,00
21              4,00  Wolfsburg  2,00
22              5,00  Wolfsburg  5,00
23              4,00  Wolfsburg   N/A
24              4,00  Wolfsburg  4,00
25              4,00  Wolfsburg  4,50
26              5,00  Wolfsburg  5,00
27              2,33  Wolfsburg  2,00
28              5,00  Wolfsburg  5,00
29              2,00  Wolfsburg  1,00
30              4,00  Wolfsburg  3,00
31              5,00  Wolfsburg  5,00
32              5,00  Wolfsburg  4,00
33              4,00  Wolfsburg  4,00

最新更新