刮削并不是刮去整页,而是刮去其中的一部分



我正试图从抓取工作档案中抓取英国石油公司的网站。最初,机器人不允许它抓取,但在我初始化ROBOSTXT_OBEY=False后,它开始工作,但现在它没有抓取整个页面。以下是我的代码:

进口废料class exxonmobilSpider(scratchy.Spider(:name=";bp";start_urls=[]https://www.bp.com/en/global/corporate/careers/search-and-apply.html?query=data+

def parse(self, response):
name=response.xpath('//h3[@class="Hit_hitTitle__3MFk3"]')
print(name)
print(len(name))[enter image description here][1]

正如您在图中看到的,xpath给出了h3标记,但当我运行代码时,我得到的是空列表。后来,我打印了所有的li或div标签,然后计算了标签的总数,进行了交叉检查,发现只有一半或部分标签被刮伤了。任何人都知道为什么scratch只抓取页面的一部分,而不是整页。还附加了比较图像。在此处输入图像描述你可以看到li标签的总数是55但是现在检查响应变量"0"的长度;名称";。在此处输入图像描述

希望OP在下一个问题中包含一个可重复的最小示例,下面是获得这些工作的方法。请记住,作业是由页面中的Javascript从API中提取的,所以您需要使用splash/ascrapy-player,或者直接刮取API。我们将做后者。API url是从浏览器的开发工具-网络选项卡获取的。

import scrapy

class BpscrapeSpider(scrapy.Spider):
name = 'bpscrape'
allowed_domains = ['algolianet.com', 'bp.com']
def start_requests(self):
headers = {
'x-algolia-application-id': 'RF87OIMXXP',
'x-algolia-api-key': 'f4f167340049feccfcf6141fb7b90a5d',
'Origin': 'https://www.bp.com',
'content-type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
api_url='https://rf87oimxxp-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.9.1)%3B%20Browser%3B%20JS%20Helper%20(3.4.4)%3B%20react%20(17.0.2)%3B%20react-instantsearch%20(6.11.0)'
payload = '{"requests":[{"indexName":"candidatematcher_bp_navapp_prod","params":"highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&filters=type%3A%20Professionals&hitsPerPage=100&query=data%20scientist&maxValuesPerFacet=20&page=0&facets=%5B%22country%22%2C%22group%22%5D&tagFilters="}]}'
yield scrapy.Request(
url=api_url,
headers=headers,
body=payload,
callback= self.parse,
method="POST")
def parse(self, response):
data = response.json()['results'][0]['hits']
for x in data:
yield x

使用scrapy crawl bpscrape -o bpdsjobs.json运行以获得包含所有26个作业的json文件。您需要进行一些数据清理,因为json响应非常全面,并且包含许多html标签等。

有关报废文件,请参阅https://docs.scrapy.org/en/latest/

非常感谢您的代码。我正试图对苏富比做同样的事情。他们也在使用algolia。我正在使用以下代码。我得到以下错误SyntaxError:行19 上的无效语法

name = 'sothebys'
allowed_domains = ['algolianet.com', 'sothebys.com']
def start_requests(self):
headers = {
#'x-algolia-application-id': 'KAR1UEUPJD',
#'x-algolia-api-key': 'ZGYwMDE4ZmE1MzhlNjQ0NDQ4NzA0MDQ4MWY2YjZhYzNlNDg4OWIzMmQ2YmE1NjdmMWYyYTQ1YzBkNGM1YzdlNnZhbGlkVW50aWw9MTY4MTE1OTM1NyZyZXN0cmljdEluZGljZXM9YXR0cmlidXRlX2ZpeGVkX3ZhbHVlcyUyQ3Byb2RfYXR0cmlidXRlX2ZpeGVkX3ZhbHVlcyUyQ3Byb2RfYXR0cmlidXRlX2ZpeGVkX3ZhbHVlc18qJTJDYXVjdGlvbnMlMkNwcm9kX2F1Y3Rpb25zJTJDcHJvZF9hdWN0aW9uc18qJTJDcHJvZF9hdWN0aW9uc19uYW1lX2FzYyUyQ3Byb2RfYXVjdGlvbnNfbmFtZV9kZXNjJTJDcHJvZF9hdWN0aW9uc19zdGFydERhdGVfYXNjJTJDcHJvZF9hdWN0aW9uc19zdGFydERhdGVfZGVzYyUyQ3Byb2RfYXVjdGlvbnNfZW5kRGF0ZV9hc2MlMkNwcm9kX2F1Y3Rpb25zX2VuZERhdGVfZGVzYyUyQ3Byb2RfYXVjdGlvbnNfY2xvc2VEYXRlX2FzYyUyQ3Byb2RfYXVjdGlvbnNfY2xvc2VEYXRlX2Rlc2MlMkNjcmVhdG9ycyUyQ3Byb2RfY3JlYXRvcnMlMkNwcm9kX2NyZWF0b3JzXyolMkNjcmVhdG9yc1YyJTJDcHJvZF9jcmVhdG9yc1YyJTJDcHJvZF9jcmVhdG9yc1YyXyolMkNpdGVtcyUyQ3Byb2RfaXRlbXMlMkNwcm9kX2l0ZW1zXyolMkNsb3RzJTJDcHJvZF9sb3RzJTJDcHJvZF9sb3RzXyolMkNwcm9kX2xvdHNfbG90TnJfYXNjJTJDcHJvZF9sb3RzX2xvdE5yX2Rlc2MlMkNwcm9kX2xvdHNfYXVjdGlvbkRhdGVfYXNjJTJDcHJvZF9sb3RzX2F1Y3Rpb25EYXRlX2Rlc2MlMkNwcm9kX3VwY29taW5nX2xvdHNfYXNjJTJDcHJvZF91cGNvbWluZ19sb3RzX2Rlc2MlMkNwcm9kX2xvdHNfbG93RXN0aW1hdGVfYXNjJTJDcHJvZF9sb3RzX2xvd0VzdGltYXRlX2Rlc2MlMkNwcm9kX3N1Z2dlc3RlZF9sb3RzJTJDcHJvZF9meWVvX2xvdHNfYXVjdGlvbkRhdGVfYXNjJTJDcHJvZF9meWVvX2xvdHNfYXVjdGlvbkRhdGVfZGVzYyUyQ29iamVjdF90eXBlcyUyQ3Byb2Rfb2JqZWN0X3R5cGVzJTJDcHJvZF9vYmplY3RfdHlwZXNfKiUyQ2F0dHJpYnV0ZXMlMkNwcm9kX2F0dHJpYnV0ZXMlMkNwcm9kX2F0dHJpYnV0ZXNfKiUyQ3BpZWNlcyUyQ3Byb2RfcGllY2VzJTJDcHJvZF9waWVjZXNfKiUyQ3Byb2R1Y3RfaXRlbXMlMkNwcm9kX3Byb2R1Y3RfaXRlbXMlMkNwcm9kX3Byb2R1Y3RfaXRlbXNfKiUyQ3Byb2RfcHJvZHVjdF9pdGVtc19sb3dFc3RpbWF0ZV9hc2MlMkNwcm9kX3Byb2R1Y3RfaXRlbXNfbG93RXN0aW1hdGVfZGVzYyUyQ3Byb2RfcHJvZHVjdF9pdGVtc19wdWJsaXNoRGF0ZV9hc2MlMkNwcm9kX3Byb2R1Y3RfaXRlbXNfcHVibGlzaERhdGVfZGVzYyUyQ3NvdGhlYnlzX2NhdGVnb3JpZXMlMkNzb3RoZWJ5c19jYXRlZ29yaWVzJTJDc290aGVieXNfY2F0ZWdvcmllc18qJTJDdGFnZ2luZ190YWdzZXRzJTJDcHJvZF90YWdnaW5nX3RhZ3NldHMlMkNwcm9kX3RhZ2dpbmdfdGFnc2V0c18qJTJDdGFnZ2luZ190YWdzJTJDcHJvZF90YWdnaW5nX3RhZ3MlMkNwcm9kX3RhZ2dpbmdfdGFnc18qJTJDb25ib2FyZGluZ190b3BpY3MlMkNwcm9kX29uYm9hcmRpbmdfdG9waWNzJTJDcHJvZF9vbmJvYXJkaW5nX3RvcGljc18qJTJDZm9sbG93YWJsZV90b3BpY3MlMkNwcm9kX2ZvbGxvd2FibGVfdG9waWNzJTJDcHJvZF9mb2xsb3dhYmxlX3RvcGljc18qJTJDd2luZSUyQ3Byb2Rfd2luZSUyQ3Byb2Rfd2luZV8qJmZpbHRlcnM9Tk9UK3N0YXRlJTNBQ3JlYXRlZCtBTkQrTk9UK3N0YXRlJTNBRHJhZnQrQU5EK05PVCtpc1Rlc3RSZWNvcmQlM0QxK0FORCslMjhOT1QrbG9jYXRpb24lM0ElMjJTaGFuZ2hhaStBdWN0aW9uJTIyJTI5K0FORCtOT1QrbG90U3RhdGUlM0FDcmVhdGVkK0FORCtOT1QrbG90U3RhdGUlM0FEcmFmdCtBTkQrJTI4Tk9UK2lzSGlkZGVuJTNBdHJ1ZStPUitsZWFkZXJJZCUzQTAwMDAwMDAwLTAwMDAtMDAwMC0wMDAwLTAwMDAwMDAwMDAwMCUyOQ==',
'Origin': 'https://www.sothebys.com/',
'content-type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
api_url='https://kar1ueupjd-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia for JavaScript (4.14.3); Browser (lite); JS Helper (3.11.3); react (18.2.0); react-instantsearch (6.39.0)&x-algolia-api-key=NDYyNjhkYWVlNTQxOTAwNmMyYTBjODZiZjBiZmJlYzY4OGNlYzBmZjJjZjM5M2MxOTRjNmMyNjljNjAyYjY5MXZhbGlkVW50aWw9MTY4MTE2NDg1NyZyZXN0cmljdEluZGljZXM9YXR0cmlidXRlX2ZpeGVkX3ZhbHVlcyUyQ3Byb2RfYXR0cmlidXRlX2ZpeGVkX3ZhbHVlcyUyQ3Byb2RfYXR0cmlidXRlX2ZpeGVkX3ZhbHVlc18qJTJDYXVjdGlvbnMlMkNwcm9kX2F1Y3Rpb25zJTJDcHJvZF9hdWN0aW9uc18qJTJDcHJvZF9hdWN0aW9uc19uYW1lX2FzYyUyQ3Byb2RfYXVjdGlvbnNfbmFtZV9kZXNjJTJDcHJvZF9hdWN0aW9uc19zdGFydERhdGVfYXNjJTJDcHJvZF9hdWN0aW9uc19zdGFydERhdGVfZGVzYyUyQ3Byb2RfYXVjdGlvbnNfZW5kRGF0ZV9hc2MlMkNwcm9kX2F1Y3Rpb25zX2VuZERhdGVfZGVzYyUyQ3Byb2RfYXVjdGlvbnNfY2xvc2VEYXRlX2FzYyUyQ3Byb2RfYXVjdGlvbnNfY2xvc2VEYXRlX2Rlc2MlMkNjcmVhdG9ycyUyQ3Byb2RfY3JlYXRvcnMlMkNwcm9kX2NyZWF0b3JzXyolMkNjcmVhdG9yc1YyJTJDcHJvZF9jcmVhdG9yc1YyJTJDcHJvZF9jcmVhdG9yc1YyXyolMkNpdGVtcyUyQ3Byb2RfaXRlbXMlMkNwcm9kX2l0ZW1zXyolMkNsb3RzJTJDcHJvZF9sb3RzJTJDcHJvZF9sb3RzXyolMkNwcm9kX2xvdHNfbG90TnJfYXNjJTJDcHJvZF9sb3RzX2xvdE5yX2Rlc2MlMkNwcm9kX2xvdHNfYXVjdGlvbkRhdGVfYXNjJTJDcHJvZF9sb3RzX2F1Y3Rpb25EYXRlX2Rlc2MlMkNwcm9kX3VwY29taW5nX2xvdHNfYXNjJTJDcHJvZF91cGNvbWluZ19sb3RzX2Rlc2MlMkNwcm9kX2xvdHNfbG93RXN0aW1hdGVfYXNjJTJDcHJvZF9sb3RzX2xvd0VzdGltYXRlX2Rlc2MlMkNwcm9kX3N1Z2dlc3RlZF9sb3RzJTJDcHJvZF9meWVvX2xvdHNfYXVjdGlvbkRhdGVfYXNjJTJDcHJvZF9meWVvX2xvdHNfYXVjdGlvbkRhdGVfZGVzYyUyQ29iamVjdF90eXBlcyUyQ3Byb2Rfb2JqZWN0X3R5cGVzJTJDcHJvZF9vYmplY3RfdHlwZXNfKiUyQ2F0dHJpYnV0ZXMlMkNwcm9kX2F0dHJpYnV0ZXMlMkNwcm9kX2F0dHJpYnV0ZXNfKiUyQ3BpZWNlcyUyQ3Byb2RfcGllY2VzJTJDcHJvZF9waWVjZXNfKiUyQ3Byb2R1Y3RfaXRlbXMlMkNwcm9kX3Byb2R1Y3RfaXRlbXMlMkNwcm9kX3Byb2R1Y3RfaXRlbXNfKiUyQ3Byb2RfcHJvZHVjdF9pdGVtc19sb3dFc3RpbWF0ZV9hc2MlMkNwcm9kX3Byb2R1Y3RfaXRlbXNfbG93RXN0aW1hdGVfZGVzYyUyQ3Byb2RfcHJvZHVjdF9pdGVtc19wdWJsaXNoRGF0ZV9hc2MlMkNwcm9kX3Byb2R1Y3RfaXRlbXNfcHVibGlzaERhdGVfZGVzYyUyQ3NvdGhlYnlzX2NhdGVnb3JpZXMlMkNzb3RoZWJ5c19jYXRlZ29yaWVzJTJDc290aGVieXNfY2F0ZWdvcmllc18qJTJDdGFnZ2luZ190YWdzZXRzJTJDcHJvZF90YWdnaW5nX3RhZ3NldHMlMkNwcm9kX3RhZ2dpbmdfdGFnc2V0c18qJTJDdGFnZ2luZ190YWdzJTJDcHJvZF90YWdnaW5nX3RhZ3MlMkNwcm9kX3RhZ2dpbmdfdGFnc18qJTJDb25ib2FyZGluZ190b3BpY3MlMkNwcm9kX29uYm9hcmRpbmdfdG9waWNzJTJDcHJvZF9vbmJvYXJkaW5nX3RvcGljc18qJTJDZm9sbG93YWJsZV90b3BpY3MlMkNwcm9kX2ZvbGxvd2FibGVfdG9waWNzJTJDcHJvZF9mb2xsb3dhYmxlX3RvcGljc18qJTJDd2luZSUyQ3Byb2Rfd2luZSUyQ3Byb2Rfd2luZV8qJmZpbHRlcnM9Tk9UK3N0YXRlJTNBQ3JlYXRlZCtBTkQrTk9UK3N0YXRlJTNBRHJhZnQrQU5EK05PVCtpc1Rlc3RSZWNvcmQlM0QxK0FORCslMjhOT1QrbG9jYXRpb24lM0ElMjJTaGFuZ2hhaStBdWN0aW9uJTIyJTI5K0FORCtOT1QrbG90U3RhdGUlM0FDcmVhdGVkK0FORCtOT1QrbG90U3RhdGUlM0FEcmFmdCtBTkQrJTI4Tk9UK2lzSGlkZGVuJTNBdHJ1ZStPUitsZWFkZXJJZCUzQTAwMDAwMDAwLTAwMDAtMDAwMC0wMDAwLTAwMDAwMDAwMDAwMCUyOQ==&x-algolia-application-id=KAR1UEUPJD'
payload = '{"requests":[{"indexName":"prod_product_items","params":"clickAnalytics=true&facets=%5B%22department%22%2C%22categories.lvl0%22%2C%22categories.lvl2%22%2C%22categories.lvl3%22%2C%22creators%22%2C%22Watch%20Model%22%2C%22Complication%22%2C%22lowEstimate%22%2C%22highEstimate%22%2C%22Gender%22%2C%22Period%20-%20General%22%2C%22Year%22%2C%22Movement%20Type%22%2C%22Case%20Size%20(mm)%22%2C%22Case%20Material(s)%22%2C%22Bezel%20Material(s)%22%2C%22Dial%20Color(s)%22%2C%22Ships%20from%20-%20Country%22%2C%22International%20Shipping%22%5D&filters=waysToBuy%3AbuyNow%20AND%20categories.lvl1%3A'Luxury%20%3E%20Watches'%20AND%20objectTypes%3AWatch%20AND%20'Certified%20Pre-Owned%20By%20Bucherer'%3Atrue&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&highlightPreTag=%3Cais-highlight-0000000000%3E&maxValuesPerFacet=1000&page=0&query=&ruleContexts=%5B%22en_luxury_watches_watch_bucherer-certified-pre-owned%22%5D&tagFilters="}]}'
yield scrapy.Request(
url=api_url,
headers=headers,
body=payload,
callback= self.parse,
method="POST")
def parse(self, response):
data = response.json()['results'][0]['hits']
for x in data:
yield x

最新更新