Scrapy - Splash获取动态数据



我正在尝试从这个页面(在其他页面中)获取动态电话号码:https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html

在类为page-action click-tel的元素div上单击后出现电话号码。我试图得到这些数据与scrapy_splash使用LUA脚本执行点击。

在我的ubuntu上拉splash后:

sudo docker run -d -p 8050:8050 scrapinghub/splash

这是我的代码到目前为止(我使用代理服务):

class company(scrapy.Spider):
name = "company"
custom_settings = {
"FEEDS" : {
'/home/ubuntu/scraping/europages/data/company.json': {
'format': 'jsonlines',
'encoding': 'utf8'
}
},
"DOWNLOADER_MIDDLEWARES" : { 
'scrapy_splash.SplashCookiesMiddleware': 723, 
'scrapy_splash.SplashMiddleware': 725, 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 
},
"SPLASH_URL" : 'http://127.0.0.1:8050/',
"SPIDER_MIDDLEWARES" : { 
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, 
},
"DUPEFILTER_CLASS" : 'scrapy_splash.SplashAwareDupeFilter',
"HTTPCACHE_STORAGE" : 'scrapy_splash.SplashAwareFSCacheStorage'
}
allowed_domains = ['www.europages.fr']
def __init__(self, company_url):
self.company_url = "https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html" ##forced
self.item = company_item()
self.script = """
function main(splash)
splash.private_mode_enabled = false
assert(splash:go(splash.args.url))
assert(splash:wait(0.5))
local element = splash:select('.page-action.click-tel') 
local bounds = element:bounds()
element:mouse_click{x=bounds.width/2, y=bounds.height/2}
splash:wait(4)
return splash:html()
end
"""

def start_requests(self):
yield scrapy.Request(
url = self.company_url,
callback = self.parse,
dont_filter = True,
meta = {
'splash': {
'endpoint': 'execute',
'url': self.company_url,
'args': {
'lua_source': self.script,
'proxy': 'http://usernamepassword@proxyhost:port',
'html':1,
'iframes':1
}
}   
}
)
def parse(self, response):
soup = BeautifulSoup(response.body, "lxml")
print(soup.find('div',{'class','page-action click-tel'}))

问题是它没有效果,我仍然什么也没有,好像没有按钮被点击。

return splash:html()不应该返回response.bodyelement:mouse_click{x=bounds.width/2, y=bounds.height/2}的结果(因为element:mouse_click()等待变化出现)吗?

我错过什么了吗?

大多数情况下,当站点动态加载数据时,它们通过向服务器发送后台XHR请求来实现。当您单击"电话"按钮时,仔细检查网络选项卡,会发现浏览器向url https://www.europages.fr/InfosTelecomJson.json?uidsid=DEU241700-00101&id=1330发送了一个XHR请求。您可以在蜘蛛中模拟相同的效果,并完全避免使用scrapy splash。请参阅下面使用一个url的示例实现:

import scrapy
from urllib.parse import urlparse
class Company(scrapy.Spider):
name = 'company'
allowed_domains = ['www.europages.fr']
start_urls = ['https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html']
def parse(self, response):
# obtain the id and uuid to make xhr request
uuid = urlparse(response.url).path.split('/')[-1].rstrip('.html')
id = response.xpath("//div[@itemprop='telephone']/a/@onclick").re_first(r"event,'(d+)',")
yield scrapy.Request(f"https://www.europages.fr/InfosTelecomJson.json?uidsid={uuid}&id={id}", callback=self.parse_address)
def parse_address(self, response):
yield response.json()

我得到响应

{'digits': '+49 220 69 53 30'}

最新更新