在保留链接的同时,使用分页抓取动态表



我是一名初学Python的程序员,试图抓取一个具有分页的动态表(datatable(。有";第一个";以及";先前的";索引的分页按钮";0";以及";1〃;分别,后面是编号的按钮(见附图(,所以我想从索引为"的按钮1开始;2〃;然后在页面中迭代,直到我捕获了整个表,所有链接都完好无损。

<a href="#" aria-controls="datatable" data-dt-idx="2" tabindex="0">1</a>

我设法抓取了前十行表的信息,但不知道如何获取其余页面。我想我需要以某种方式循环浏览这些分页按钮(?(在阅读了无数教程和stackoverflow问题并观看了几段Youtube视频后,我设法拼凑出了以下代码。然而,我最终得到了整个网站的html,而不仅仅是我的表,并且只检索了第一页上表的前10行。

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(r"C:UsersMyNamechromedriver", options=chrome_options)
url = "https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/compliance-actions-and-activities/warning-letters"
driver.get(url)
table_confirm = WebDriverWait(driver, 20).until(
ec.presence_of_element_located((By.ID, "datatable"))
)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'lxml')
print(soup)
data = []
table = soup.find('table', {'class':'lcds-datatable table table-bordered cols-8 responsive-enabled dataTable no-footer dtr-inline collapsed'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])

有人能帮帮我吗?谢谢[1] :https://i.stack.imgur.com/RUsui.png

如果您在浏览器中查看页面,并在浏览页面时使用浏览器的开发工具记录网络流量,您会看到,每次更改页面时,都会向REST API发出XHR(XmlHttpRequest(HTTP GET请求,该请求的响应是JSON,并包含您试图抓取的所有信息。然后,该JSON被正常解析,并用于使用JavaScript异步填充DOM。

为了得到你想要的数据,你所要做的就是模仿这个请求。硒在这方面做得太过火了——你只需要requests。您甚至可以根据自己的需要对请求进行一些调整。例如,默认情况下,页面发起的请求将只获取接下来的10个结果/条目。我已经把要求改为一次抓100个,但这里真的没有优势或劣势。

def make_pretty(entry):
import re
pattern = ">([^<]*)<"
return {
"posted_date": re.search(pattern, entry[0]).group(1),
"letter_issue_date": re.search(pattern, entry[1]).group(1),
"company_name": re.search(pattern, entry[2]).group(1),
"issuing_office": entry[3],
"subject": entry[4],
"response_letter": entry[5],
"closeout_letter": entry[6]
}

def get_entries():
import requests
from itertools import count
url = "https://www.fda.gov/datatables/views/ajax"
group_length = 100
params = {
"length": group_length,
"view_display_id": "warning_letter_solr_block",
"view_name": "warning_letter_solr_index",
}
headers = {
"user-agent": "Mozilla/5.0",
"x-requested-with": "XMLHttpRequest"
}
for current_group in count(0):
start = current_group * group_length
end = ((current_group + 1) * group_length) - 1
params["start"] = start

response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
data = response.json()
if not data["data"]:
break
yield from map(make_pretty, data["data"])
print("yielding {}-{}".format(start, min(end, data["recordsFiltered"])))
def main():
global all_entries
all_entries = list(get_entries())
print("Total number of entries: {}".format(len(all_entries)))
return 0

if __name__ == "__main__":
import sys
sys.exit(main())

输出:

yielding 0-99
yielding 100-199
yielding 200-299
yielding 300-399
yielding 400-499
yielding 500-599
yielding 600-699
yielding 700-799
yielding 800-899
yielding 900-999
yielding 1000-1099
yielding 1100-1199
yielding 1200-1299
yielding 1300-1399
yielding 1400-1499
yielding 1500-1599
yielding 1600-1699
yielding 1700-1799
yielding 1800-1899
yielding 1900-1999
yielding 2000-2099
yielding 2100-2199
yielding 2200-2299
yielding 2300-2399
yielding 2400-2499
yielding 2500-2599
yielding 2600-2658
Total number of entries: 2658
all_entries[0]
{'posted_date': '11/10/2021', 'letter_issue_date': '11/10/2021', 'company_name': 'Wyoming Vapor Company', 'issuing_office': 'Center for Tobacco Products', 'subject': 'Family Smoking Prevention and Tobacco Control Act/Adulterated/Misbranded', 'response_letter': '', 'closeout_letter': ''}

get_entries是一个生成器,它向REST API发出请求并生成单独的条目,直到没有更多的条目为止。

CCD_ 3用于使我们在CCD_;漂亮";。从我们接收的JSON中;条目";对应于字符串列表,其中一些字符串是HTML。make_pretty只是天真地解析每个条目中的HTML字符串,并为每个条目返回一个带有键值对的字典,使用起来更干净。

main是脚本的主要入口点。我们调用get_entries并使用生成器中的所有项,让条目累积在all_entries列表中。我只添加了global all_entries行,这样我就可以在脚本结束后在Pythonshell中玩all_entries,并检查它——这不是必需的。

看看我发布的另一个类似问题的答案,我在其中更深入地介绍了使用浏览器的开发工具、记录网络流量、查找和模仿XHR请求以及如何检查响应。


编辑:这是更新的代码:

keys = (
"posted_date",         # entry[0]
"letter_issue_date",   # entry[1]
"company_name",        # entry[2]
"company_url",         # entry[2]
"issuing_office",      # entry[3]
"subject",             # entry[4]
"response_letter_url", # entry[5]
"closeout_letter_url"  # entry[6]
)

def make_pretty(entry):
from bs4 import BeautifulSoup as Soup
import re
pattern = "[^<]*"

return dict(zip(keys, [
Soup(entry[0], "html.parser").text.strip(),
Soup(entry[1], "html.parser").text.strip(),
Soup(entry[2], "html.parser").text.strip(),
entry[2] and "https://www.fda.gov" + Soup(entry[2], "html.parser").find("a")["href"],
entry[3].strip(),
re.search(pattern, entry[4]).group(),
entry[5] and "https://www.fda.gov" + Soup(entry[5], "html.parser").find("a")["href"],
entry[6] and "https://www.fda.gov" + Soup(entry[6], "html.parser").find("a")["href"]
]))

def get_entries():
import requests
from itertools import count
url = "https://www.fda.gov/datatables/views/ajax"
group_length = 100
params = {
"length": group_length,
"view_display_id": "warning_letter_solr_block",
"view_name": "warning_letter_solr_index",
}
headers = {
"user-agent": "Mozilla/5.0",
"x-requested-with": "XMLHttpRequest"
}
for current_group in count(0):
start = current_group * group_length
end = ((current_group + 1) * group_length) - 1
params["start"] = start

response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
data = response.json()
if not data["data"]:
break
yield from map(make_pretty, data["data"])

print("yielding {}-{}".format(start, min(end, data["recordsFiltered"])))
def main():
import csv
with open("output.csv", "w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=keys, quoting=csv.QUOTE_ALL)
writer.writeheader()
writer.writerows(get_entries())
print("Done writing.")
return 0

if __name__ == "__main__":
import sys
sys.exit(main())

最新更新