刮网站2个页面的问题打开其余的没有



所以我一直在尝试用电缆和其他东西为网店编写数据抓取器。我写了简单的代码,应该可以工作。商店的产品结构按类别划分,我选择了第一类电缆。

for i in range(0, 27):
url = "https://onninen.pl/produkty/Kable-i-przewody?query=/strona:{0}"
url = url.format(i)

对于i=为0和1的前两个页面,它运行良好(我得到code_response 200(,但无论我什么时候尝试,其他页面2+都会返回错误500,我不知道为什么,尤其是当它们从同一链接手动正常打开时。我甚至试图将请求之间的时间随机化:(知道可能是什么问题吗?我应该尝试使用其他网络抓取库吗?以下是完整代码:

import requests
from fake_useragent import UserAgent
import pandas as pd
from bs4 import BeautifulSoup
import time
import random
products = []  # List to store name of the product
MIN = []  # Manufacturer item number
prices = []  # List to store price of the product
df = pd.DataFrame()
user_agent = UserAgent()
i = 0
for i in range(0, 27):
url = "https://onninen.pl/produkty/Kable-i-przewody?query=/strona:{0}"
url = url.format(i)
#print(url)
# getting the response from the page using get method of requests module
page = requests.get(url, headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"})
#print(page.status_code)
# storing the content of the page in a variable
html = page.content
# creating BeautifulSoup object
page_soup = BeautifulSoup(html, "html.parser")
#print(page_soup.prettify())
for containers in page_soup.findAll('div', {'class': 'styles__ProductsListItem-vrexg1-2 gkrzX'}):
name = containers.find('label', attrs={'class': 'styles__Label-sc-1x6v2mz-2 gmFpMA label'})
price = containers.find('span', attrs={'class': 'styles__PriceValue-sc-33rfvt-10 fVFAzY'})
man_it_num = containers.find('div', attrs={'title': 'Indeks producenta'})
formatted_name = name.text.replace('Dodaj do koszyka: ', '')
products.append(formatted_name)
prices.append(price.text)
MIN.append(man_it_num.text)
df = pd.DataFrame({'Product Name': products, 'Price': prices, 'MIN': MIN})
time.sleep(random.randint(2, 11))
#df.to_excel('output.xlsx', sheet_name='Kable i przewody')

因为通过API动态加载的总页面。因此,要获取所有数据,您必须使用API。

示例:

import pandas as pd
import requests
api_url = 'https://onninen.pl/api/search?query=/Kable-i-przewody/strona:{p}'  
headers = {
'user-agent': 'Mozilla/5.0',
'referer': 'https://onninen.pl/produkty/Kable-i-przewody?query=/strona:2',
'cookie': '_gid=GA1.2.1022119173.1663690794; _fuid=60a315c76d054fd5add850c7533f529e; _gcl_au=1.1.1522602410.1663690804; pollsvisible=[]; smuuid=1835bb31183-22686567c511-4116ddce-c55aa071-2639dbd6-ec19e64a550c; _smvs=DIRECT; poll_random_44=1; poll_visited_pages=2; _ga=GA1.2.1956280663.1663690794; smvr=eyJ2aXNpdHMiOjEsInZpZXdzIjo3LCJ0cyI6MTY2MzY5MjU2NTI0NiwibnVtYmVyT2ZSZWplY3Rpb25CdXR0b25DbGljayI6MCwiaXNOZXdTZXNzaW9uIjpmYWxzZX0=; _ga_JXR5QZ2XSJ=GS1.1.1663690794.1.1.1663692567.0.0.0'
}
dfs = []
for p in range(1,28):
d=requests.get(api_url.format(p=p),headers=headers).json()['items'][0]['items']
df = pd.DataFrame(d)
dfs.append(df)
df = pd.concat(dfs)
print(df)

输出:

id                                               slug   index    catalogindex  ... onntopcb  isnew    qc   ads
0   147774  KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-3x...  HES890  112271067D0500  ...        0  False  None  None
1    45315  KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-3x...  HES893  112271068D0500  ...        0  False  None  None
2   169497  KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-3x...  HES896  112271069D0500  ...        0  False  None  None
3   141820  KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-4x...  HES900  112271056D0500  ...        0  False  None  None
4    47909  KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-4x...  HES903  112271064D0500  ...        0  False  None  None
..     ...                                                ...     ...             ...  ...      ...    ...   ...   ...
37  111419  NVENT-RAYCHEM-Kabel-grzejny-EM2-XR-samoreguluj...  HDZ938      449561-000  ...        0   True  None  None
38  176526  NVENT-RAYCHEM-Przewod-stalooporowy-GM-2CW-35m-...  HEA099      SZ18300102  ...        0  False  None  None
39   38484  DEVI-Mata-grzewcza-DEVIheat-150S-150W-m2-375W-...  HAJ162        140F0332  ...        1  False  None  None
40   60982  DEVI-Mata-grzewcza-DEVImat-150T-150W-m2-375W-0...  HAJ157        140F0448  ...        1  False  None  None
41  145612  DEVI-Czujnik-Devireg-850-rynnowy-czujnik-140F1...  HAJ212        140F1086  ...        0  False  None  None
[1292 rows x 27 columns]

相关内容

最新更新