试图在python中抓取网页上最后一个文档的日期



我想要得到的日期如下:01/19/2021,我想要得到的是"19"在python变量

<span class="grayItalic">
Received: 01/19/2021
</span>

下面这段代码不起作用:

date = soup.find('span', {'class': 'grayItalic'}).get_text()
converted_date = int(date[13:14])
print(date)

我得到这个错误:'NoneType'对象没有属性'get_text'有谁能帮忙吗?

尝试使用header:

import requests
from bs4 import BeautifulSoup
url = "https://iapps.courts.state.ny.us/nyscef/DocumentList?docketId=npvulMdOYzFDYIAomW_PLUS_elw==&display=all"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content,'html.parser')
date = soup.find('span', {'class': 'grayItalic'}).get_text().strip()
converted_date = int(date.split("/")[-2])
print(converted_date)
print(date)
import dateutil.parser
from bs4 import BeautifulSoup
html_doc=""""<span class="grayItalic">
Received: 01/19/2021
</span>"""
soup=BeautifulSoup(html_doc,'html.parser')
date_ = soup.find('span', {'class': 'grayItalic'}).get_text()
dateutil.parser.parse(date_,fuzzy=True)

输出:

datetime.datetime(2021, 1, 19, 0, 0)

date_输出'n Received: 01/19/2021n'你有字符串切片,而你可以使用[dateutil.parser]。它将返回一个日期时间。Datetime对象。在这种情况下,我假设你只需要日期。如果你也需要文本,你可以使用fuzzy_with_tokens=True

if the fuzzy_with_tokens option is True, returns a tuple, the first element being a datetime.datetime object, the second a tuple containing the fuzzy tokens.

dateutil.parser.parse(date_,fuzzy_with_tokens=True)

(datetime.datetime(2021, 1, 19, 0, 0), (' Received: ', ' '))

我无法使用请求或urllib模块加载URL。我猜是网站屏蔽了自动连接请求。所以我打开网页,将源代码保存在一个名为page.html的文件中,并在其中运行BeautifulSoup操作。这似乎起作用了。

html = open("page.html")
soup = BeautifulSoup(html, 'html.parser')
date_span = soup.find('span', {'class': 'grayItalic'})
if date_span is not None:
print(str(date_span.text).strip().replace("Received: ", ""))
# output: 04/25/2019

我尝试抓取源代码与请求库如下,但它没有工作(可能是网页阻止请求)。查看它是否在您的机器上运行。

url = "..."
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
response = requests.get(url, headers=headers)
html = response.content
print(html)

最新更新