我需要从网站https://secc.gov.in/lgdStateList
抓取PDF。州、区和街区有 3 个下拉菜单。 有几个州,每个州下我们有区,每个区下都有街区。
我尝试实现以下代码。我能够选择州,但是当我选择地区时似乎有一些错误。
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import time
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Chrome()
url = ("https://secc.gov.in/lgdStateList")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source, 'html.parser')
for name_list in soup.find_all(class_ ='dropdown-row'):
print(name_list.text)
driver = webdriver.Chrome()
driver.get('https://secc.gov.in/lgdStateList')
selectState = Select(driver.find_element_by_id("lgdState"))
for state in selectState.options:
state.click()
selectDistrict = Select(driver.find_element_by_id("lgdDistrict"))
for district in selectDistrict.options:
district.click()
selectBlock = Select(driver.find_element_by_id("lgdBlock"))
for block in selectBlock.options():
block.click()
我遇到的错误是:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="lgdDistrict"]"}
(Session info: chrome=83.0.4103.106)
我需要帮助爬行 3 个菜单。
任何帮助/建议将不胜感激。让我知道评论中的任何澄清。
在这里,您可以找到不同状态的值。您可以从地区和街区下拉列表中找到相同的内容。
现在,您应该在有效负载中使用这些值来获取要从中获取数据的表:
import urllib3
import requests
from bs4 import BeautifulSoup
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
link = "https://secc.gov.in/lgdGpList"
payload = {
'stateCode': '10',
'districtCode': '188',
'blockCode': '1624'
}
r = requests.post(link,data=payload,verify=False)
soup = BeautifulSoup(r.text,"html.parser")
for items in soup.select("table#example tr"):
data = [' '.join(item.text.split()) for item in items.select("th,td")]
print(data)
输出脚本生成:
['Select State', 'Select District', 'Select Block']
['', 'Select District', 'Select Block']
['ARARIA BASTI (93638)', 'BANGAMA (93639)', 'BANSBARI (93640)']
['BASANTPUR (93641)', 'BATURBARI (93642)', 'BELWA (93643)']
['BOCHI (93644)', 'CHANDRADEI (93645)', 'CHATAR (93646)']
['CHIKANI (93647)', 'DIYARI (93648)', 'GAINRHA (93649)']
['GAIYARI (93650)', 'HARIA (93651)', 'HAYATPUR (93652)']
['JAMUA (93653)', 'JHAMTA (93654)', 'KAMALDAHA (93655)']
['KISMAT KHAWASPUR (93656)', 'KUSIYAR GAWON (93657)', 'MADANPUR EAST (93658)']
['MADANPUR WEST (93659)', 'PAIKTOLA (93660)', 'POKHARIA (93661)']
['RAMPUR KODARKATTI (93662)', 'RAMPUR MOHANPUR EAST (93663)', 'RAMPUR MOHANPUR WEST (93664)']
['SAHASMAL (93665)', 'SHARANPUR (93666)', 'TARAUNA BHOJPUR (93667)']
您需要抓取上面每个结果旁边的括号中的可用数字,然后在payload
中使用它们并发送另一个 post 请求以下载 pdf 文件。确保在执行之前将脚本放在文件夹中,以便可以获取其中的所有文件。
import urllib3
import requests
from bs4 import BeautifulSoup
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
link = "https://secc.gov.in/lgdGpList"
download_link = "https://secc.gov.in/downloadLgdwisePdfFile"
payload = {
'stateCode': '10',
'districtCode': '188',
'blockCode': '1624'
}
r = requests.post(link,data=payload,verify=False)
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select("table#example td > a[onclick^='downloadLgdFile']"):
gp_code = item.text.strip().split("(")[1].split(")")[0]
payload['gpCode'] = gp_code
with open(f'{gp_code}.pdf','wb') as f:
f.write(requests.post(download_link,data=payload,verify=False).content)