插入一个代码,点击一个按钮,用Scrapy提取结果



我声明我从未使用过Scrapy(因此我甚至不知道它是否是正确的工具(。

在网站上https://www.ufficiocamerale.it/,我有兴趣在条"中输入11位数字代码(例如06655971007(;INSERISCI LA PARTITA IVA/愤怒的社会";然后点击";CERCA";。然后,我想将生成的HTML保存在一个变量中,稍后我将使用BeautifulSoup进行分析(我应该不会有任何问题(。那么,我该怎么做第一部分呢?

我想象的是:

import scrapy
class Extraction(scrapy.Spider):
def start_requests(self):
url = "https://www.ufficiocamerale.it/"
# To enter data
yield scrapy.FormRequest(url=url, formdata={...}, callback=self.parse)
# To click the button
# some code
def parse(self, response):
print(response.body)

这些是搜索栏的HTML和按钮:

<input type="search" name="search_input" class="autocomplete form-control" onchange="if (!window.__cfRLUnblockHandlers) return false; checkPartitaIva()" onkeyup="if (!window.__cfRLUnblockHandlers) return false; checkPartitaIva()" id="search_input" placeholder=" " value="">
<button onclick="if (!window.__cfRLUnblockHandlers) return false; dataLayer.push({'event': 'trova azienda'});" type="submit" class="btn btn-primary btn-sm text-uppercase">Cerca</button>

它使用JavaScript生成一些元素,因此使用Selenium 会更简单

from selenium import webdriver
import time
url =  'https://www.ufficiocamerale.it/'
driver = webdriver.Firefox()
driver.get(url)
time.sleep(5)  # JavaScript needs time to load code
item = driver.find_element_by_xpath('//form[@id="formRicercaAzienda"]//input[@id="search_input"]')
#item = driver.find_element_by_id('search_input')
item.send_keys('06655971007')
time.sleep(1)
button = driver.find_element_by_xpath('//form[@id="formRicercaAzienda"]//p//button[@type="submit"]')
button.click()
time.sleep(5)  # JavaScript needs time to load code
item = driver.find_element_by_tag_name('h1')
print(item.text)
print('---')
all_items = driver.find_elements_by_xpath('//ul[@id="first-group"]/li')
for item in all_items:
if '@' in item.text:
print(item.text, '<<< found email:', item.text.split(' ')[1])
else:
print(item.text)
print('---')

结果:

DATI DELLA SOCIETÀ - ENEL ENERGIA S.P.A.
---
Partita IVA: 06655971007 - Codice Fiscale: 06655971007
Rag. Sociale: ENEL ENERGIA S.P.A.
Indirizzo: VIALE REGINA MARGHERITA 125 - 00198 - ROMA
Rea: 1150724
PEC: enelenergia@pec.enel.it <<< found email: enelenergia@pec.enel.it
Fatturato: € 13.032.695.000,00 (2020)
ACQUISTA BILANCIO
Dipendenti : 1666 (2021)
---

最新更新