用于产品分页的数据抓取,以获取所有产品的详细信息



我要抓取URL = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'的'气垫套'类别的所有产品数据我分析的数据是在脚本标签,但如何从所有页面获得数据。我需要URL的所有产品的页面和不同页面的数据也在API API = ' https://www.noon.com/_next/data/B60DhzfamQWEpEl9Q8ajE/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover.json?limit=50&页面= 2,% 5思% 5 d = popularity& % 5 bdir % 5 d = desc&分类目录= home-and-kitchen&目录= home-decor&目录= slipcovers&目录=靠垫'
如果我们继续改变页面num在上面的链接我们有各自的数据但是如何从不同的页面获取数据请给出建议。

import requests
import pandas as pd
import json
import csv
from lxml import html
headers ={'authority': 'www.noon.com',
'accept' : 
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'
prodresp = requests.get(produrl, headers = headers, timeout =30)
prodResphtml = html.fromstring(prodresp.text)
print(prodresp)

partjson = prodResphtml.xpath('//script[@id= "__NEXT_DATA__"]/text()')
partjson = partjson[0]

你就要达到你的目标了。您可以使用for loop and range function对所有页面进行分页,因为我们知道总页码是192,这就是为什么我以这种健壮的方式进行分页。因此,要从所有页面获取所有产品url(或任何数据项),您可以按照下面的示例操作。

脚本:

import requests
import pandas as pd
import json
from lxml import html
headers ={
'authority': 'www.noon.com',
'accept' :'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/?limit=50&page={page}&sort[by]=popularity&sort[dir]=desc'
data=[]
for page in range(0,192):
prodresp = requests.get(produrl.format(page=page), headers = headers, timeout =30)
prodResphtml = html.fromstring(prodresp.text)
#print(prodresp)
partjson = prodResphtml.xpath('//script[@id= "__NEXT_DATA__"]/text()')
#partjson = partjson[0]
partjson = json.loads(partjson[0])
#print(partjson)
# with open('data.json','w',encoding='utf-8') as f:
#     f.write(partjson)
for item in partjson['props']['pageProps']['props']['catalog']['hits']:
link='https://www.noon.com/'+item['url']
data.append(link)
df = pd.DataFrame(data,columns=['URL'])
#df.to_csv('product.csv',index=False)#to save data into your system
print(df)

输出:

URL
0     https://www.noon.com/graphic-geometric-pattern...
1     https://www.noon.com/classic-nordic-decorative...
2     https://www.noon.com/embroidered-iconic-medusa...
3     https://www.noon.com/geometric-marble-texture-...
4     https://www.noon.com/traditional-damask-motif-...
...                                                 ...
9594  https://www.noon.com/geometric-printed-cushion...
9595  https://www.noon.com/chinese-style-art-printed...
9596  https://www.noon.com/chinese-style-art-printed...
9597  https://www.noon.com/chinese-style-art-printed...
9598  https://www.noon.com/chinese-style-art-printed...
[9599 rows x 1 columns]

我用的是relib。换句话说,我使用正则表达式它是更好地抓取任何页面使用JavaScript

import requests
import pandas as pd
import json
import csv
from lxml import html
import re
headers ={
'authority': 'www.noon.com',
'accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
url = "https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/"
prodresp = requests.get(url, headers = headers, timeout =30)
jsonpage = re.findall(r'type="application/json">(.*?)</script>', prodresp.text)
jsonpage = json.loads(jsonpage[0])

最新更新