返回基于Beautiful Soup/Python中的条件的表信息



我正在尝试抓取此页面:https://www.nysenate.gov/legislation/bills/2019/s8450

我只想从表(单击"查看操作"时显示的表(中提取信息,如果它包含以下字符串:"Delivered To Governor"

我可以在表中进行迭代,但在试图去掉所有额外的标记文本时遇到了困难。

url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
bill_life_cycle_table = soup.find("tbody")
bill_life_cycle_table

您可以提供if条件来检查单元格中是否存在字符串并查找上一个单元格值。使用css选择器select()

from bs4 import BeautifulSoup
import requests
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
tablebody=soup.select_one(".table.c-bill--actions-table > tbody")
for item in tablebody.select("td"):
if "delivered to governor" in item.text:
print(item.find_previous("td").text)

控制台输出:

Dec 11, 2020

使用bs4.element.Tag.text方法:

from bs4 import BeautifulSoup
import requests
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
bill_life_cycle_table = soup.find("tbody")
print(bill_life_cycle_table.text)

输出:


Dec 11, 2020
delivered to governor
Jul 23, 2020
returned to assemblypassed senate3rd reading cal.908substituted for s8450c
Jul 23, 2020
substituted by a10500c
Jul 22, 2020
ordered to third reading cal.908
Jul 20, 2020
reported and committed to rules
Jul 18, 2020
print number 8450c
Jul 18, 2020
amend and recommit to health
Jul 09, 2020
print number 8450b
Jul 09, 2020
amend and recommit to health
Jun 05, 2020
print number 8450a
Jun 05, 2020
amend and recommit to health
Jun 03, 2020
referred to health 

更新:

对于打印日期条件:

from bs4 import BeautifulSoup
import requests
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
bill_life_cycle_table = soup.find("tbody").text.splitlines()
for a, b in zip(bill_life_cycle_table, bill_life_cycle_table[1:]):
if b.title() == "Delivered To Governor":
print(a)

输出:

Dec 11, 2020

您可以用pandas'读取<table>标记(它在引擎盖下使用BeautifulSoup(。然后按列筛选并返回日期。

代码:

import pandas as pd
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
df = pd.read_html(url)[0]
date = df[df.iloc[:,-1] == 'delivered to governor'].iloc[0,0]

输出:

print (date)
Dec 11, 2020

最新更新