似乎无法从网页中抓取特定信息?



我正试图为以下页面上显示的每个项目抓取一些信息:https://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CatalogSearchResultView?storeId=10051& catalogId = 10051, langId = 1, categoryId = 1351370,各种= New + Spirits& categoryType = Spirits& top_category = 25208, sortBy = 0, searchSource = E&页面浏览人数=,beginIndex = 0 #方面:,productBeginIndex: 0, orderBy:和页面浏览人数:,minPrice:, maxPrice:,页大小:,

但是,我似乎无法访问项目信息。我需要的信息是每个产品的名称和链接,例如第一项包含在:

<a class="catalog_item_name" aria-hidden="true" tabindex="-1" id="WC_CatalogEntryDBThumbnailDisplayJSPF_3074457345616901168_link_9b" href="/webapp/wcs/stores/servlet/ProductDisplay?catalogId=10051&amp;storeId=10051&amp;productId=3074457345616901168&amp;langId=-1&amp;partNumber=000086630prod&amp;errorViewName=ProductDisplayErrorView&amp;categoryId=1351370&amp;top_category=25208&amp;parent_category_rn=25208&amp;urlLangId=&amp;variety=New+Spirits&amp;categoryType=Spirits&amp;fromURL=%2fwebapp%2fwcs%2fstores%2fservlet%2fCatalogSearchResultView%3fstoreId%3d10051%26catalogId%3d10051%26langId%3d-1%26categoryId%3d1351370%26variety%3dNew%2bSpirits%26categoryType%3dSpirits%26top_category%3d25208%26parent_category_rn%3d%26sortBy%3d0%26searchSource%3dE%26pageView%3d%26beginIndex%3d0">Woodford Reserve Master Collection Five Malt Stouted Mash</a>

所以我要抓取的信息是:

Woodford Reserve Master Collection Five Malt Stouted Mash

/webapp/wcs/stores/servlet/ProductDisplay?catalogId=10051&amp;storeId=10051&amp;productId=3074457345616901168&amp;langId=-1&amp;partNumber=000086630prod&amp;errorViewName=ProductDisplayErrorView&amp;categoryId=1351370&amp;top_category=25208&amp;parent_category_rn=25208&amp;urlLangId=&amp;variety=New+Spirits&amp;categoryType=Spirits&amp;fromURL=%2fwebapp%2fwcs%2fstores%2fservlet%2fCatalogSearchResultView%3fstoreId%3d10051%26catalogId%3d10051%26langId%3d-1%26categoryId%3d1351370%26variety%3dNew%2bSpirits%26categoryType%3dSpirits%26top_category%3d25208%26parent_category_rn%3d%26sortBy%3d0%26searchSource%3dE%26pageView%3d%26beginIndex%3d0

我正在尝试对页面上的每个项目进行迭代。我肯定连接到页面,但由于某种原因,我不能使用for product in soup.select抓取任何信息下面是我的脚本的简化版本,我一直试图从上面的catalog_item_name

收集信息
import requests
import sys
import time
import smtplib
from email.message import EmailMessage
import hashlib
from urllib.request import urlopen
from datetime import datetime
import json
import random
import requests
from itertools import cycle
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from urllib3.exceptions import InsecureRequestWarning
from requests_html import HTMLSession
session = HTMLSession()

user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
for i in range(1,4):
#Pick a random user agent
user_agent = random.choice(user_agent_list)

url = []
url = 'https://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CatalogSearchResultView?storeId=10051&catalogId=10051&langId=-1&categoryId=1351370&variety=New+Spirits&categoryType=Spirits&top_category=25208&sortBy=0&searchSource=E&pageView=&beginIndex=0#facet:&productBeginIndex:0&orderBy:&pageView:&minPrice:&maxPrice:&pageSize:&'
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,features="html.parser")
link = []
for product in soup.select('a.catalog_item_name'):
link.append(product)
print(link)

任何帮助都将非常感激!

编辑:测试脚本与其他两个网站,它工作得很好。一定是网站的什么东西把它弄丢了?

我想这里最好的方法是检查网络流量并直接查询API。例如,对于上面的url,有一些POST请求针对https://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CategoryProductsListingView的API。

我可以使用它来获取产品列表,例如:

from bs4 import BeautifulSoup
import requests
import urllib
base_url = 'https://www.finewineandgoodspirits.com'
path = '/webapp/wcs/stores/servlet/CategoryProductsListingView?sType=SimpleSearch&resultsPerPage=15&sortBy=0&disableProductCompare=false&ajaxStoreImageDir=%2fwcsstore%2fWineandSpirits%2f&variety=New+Spirits&categoryType=Spirits&ddkey=ProductListingView'
params = {
'storeId': '10051',
'categoryId': '1351370',
'searchType': '1002'
}
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'some super fancy browser',
}
request_url = base_url + path + '&' + urllib.parse.urlencode(params)
response = requests.post(request_url, headers=headers)
soup = BeautifulSoup(response.text)
# now, extract the content form the soup, eg like you did above
product_links: list[str] = [base_url + a['href'] for a in soup.select('a.catalog_item_name')]

最新更新