Scrapy Crawl (referer: None)



我是新的scrapy和python我是从Aliexpress.com与剧作家方法刮数据,它返回(参考文献:无):这是我的代码

class AliSpider(scrapy.Spider):
name = "aliex"
def start_requests(self):
# GET request
search_value = 'phones'
yield scrapy.Request(f"https://www.aliexpress.com/premium/{search_value}.html?spm=a2g0o.productlist.1000002.0&initiative_id=SB_20230118063054&dida=y",
meta=dict(
playwright= True,
playwright_include_page = True,
playwright_page_coroutines =[
PageMethod('wait_for_selector', '.list--gallery--34TropR')
]
))

async def parse(self, response):
for data in response.xpath("//h1"):
related_link = data.xpath(".//text()").get()
yield{
'related_link':related_link
}

I am getting

2023-01-18 19:56:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.aliexpress.com/wholesale?SearchText=phones&spm=a2g0o.productlist.1000002.0&initiative_id=SB_20230118063054&dida=y> (referer: None)
2023-01-18 19:56:55 [scrapy.core.engine] INFO: Closing spider (finished)

我尝试了xpath和css选择器,但结果相同。谁能帮我?

这是使用python的独立剧作家的完整解决方案,它可以在windows中找到。该网站通过JavaScript动态加载数据,这就是为什么我使用page.evaluate ()方法来执行JavaScript并滚动整个页面,否则,它将无法抓取完整的ResultSets。

脚本:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import pandas as pd
import time
data = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
search_value = 'phones'
for page_num in range(1,4):

page.goto(f"https://www.aliexpress.com/wholesale?SearchText=phones&catId=0&dida=y&g=y&initiative_id=SB_20230118063054&page={page_num}&spm=a2g0o.productlist.1000002.0&trafficChannel=main")
page.wait_for_selector('[class="manhattan--content--1KpBbUi"]',timeout=30000)
scroll_height = page.evaluate("""() => {
return Math.max(
document.body.scrollHeight, document.documentElement.scrollHeight,
document.body.offsetHeight, document.documentElement.offsetHeight,
document.body.clientHeight, document.documentElement.clientHeight
);
}""")
current_height = 0
while current_height < scroll_height:
current_height = page.evaluate("""() => {
window.scrollBy(0, window.innerHeight);
return window.scrollY;
}""")
time.sleep(2)
soup = BeautifulSoup(page.content(), 'lxml')
for card in soup.select('[class="manhattan--content--1KpBbUi"]'):
title = card.h1.text
data.append({'title':title})
df = pd.DataFrame(data)
print(df)

输出:

title
0    Unlock Samsung Galaxy S10 S10+ s10e G970U G973...
1    SERVO K07 Plus mini Mobile Phone Pen Dual SIM ...
2    BLACKVIEW OSCAL C80 Smartphone 6.5" Waterdrop ...
3    Original Apple iPhone 7 Unlocked 99% New Mobil...
4    [World Premiere] Blackview BV9200 Rugged Smart...
..                                                 ...
175  Motorola StarTAC Rainbow 500mAh Fashion 90% Ne...
176  Original International Version HuaWei P30 Pro ...
177  Unlocked Original Apple iPhone SE Dual Core 2G...
178  2022 Unihertz TANK Large Battery Rugged Smartp...
179  75W Car Wireless Charger Car Mount Phone Holde...
[180 rows x 1 columns]

最新更新