我正试图从抓取谷歌新闻页面的汤中获取日期
date_section = soup.find_all('div', {"class": "OSrXXb ZE0LJd YsWzw"})
print(date_section)
输出:
[<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>30 May 2020</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>3 weeks ago</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>1 week ago</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>2 weeks ago</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>22 Nov 2020</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>19 Mar 2019</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>18 Mar 2019</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>11 Aug 2019</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>1 Aug 2019</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>4 Jun 2009</span></div>]
我想从这个结构中得到所有日期的list
。
这就是我当前访问日期的方式,并且可以通过循环获得日期的list
。我想知道是否有更优雅的方式使用BeautifulSoup
来访问这样一个结构中的日期。
print("First date",date_section[0].text)
尝试:
from bs4 import BeautifulSoup
html ='''
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>30 May 2020</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>3 weeks ago</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>1 week ago</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>2 weeks ago</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>22 Nov 2020</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>19 Mar 2019</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>18 Mar 2019</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>11 Aug 2019</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>1 Aug 2019</span></div>,
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>4 Jun 2009</span></div>
'''
soup= BeautifulSoup(html, 'lxml')
date_section = soup.find_all('div', {"class": "OSrXXb ZE0LJd YsWzw"})
for d in date_section:
print(d.text)
输出:
30 May 2020
3 weeks ago
1 week ago
2 weeks ago
22 Nov 2020
19 Mar 2019
18 Mar 2019
11 Aug 2019
1 Aug 2019
4 Jun 2009
我想从这个结构中获得所有日期的列表。
要获得list
,只需迭代ResultSet
,例如使用list comprehension
:
[e.get_text(strip=True) for e in soup.find_all('div', {"class": "OSrXXb ZE0LJd YsWzw"})]
或与css selectors
:
[e.get_text(strip=True) for e in soup.select('div.OSrXXb.ZE0LJd.YsWzw span')]
都将导致:
['30 May 2020', '3 weeks ago', '1 week ago', '2 weeks ago', '22 Nov 2020', '19 Mar 2019', '18 Mar 2019', '11 Aug 2019', '1 Aug 2019', '4 Jun 2009']
from bs4 import BeautifulSoup
html ='''
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>30 May 2020</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>3 weeks ago</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>1 week ago</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>2 weeks ago</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>22 Nov 2020</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>19 Mar 2019</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>18 Mar 2019</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>11 Aug 2019</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>1 Aug 2019</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>4 Jun 2009</span></div>
'''
soup= BeautifulSoup(html)
[e.get_text(strip=True) for e in soup.find_all('div', {"class": "OSrXXb ZE0LJd YsWzw"})]
from bs4 import BeautifulSoup
html ='''
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>30 May 2020</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>3 weeks ago</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>1 week ago</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>2 weeks ago</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>22 Nov 2020</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>19 Mar 2019</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>18 Mar 2019</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>11 Aug 2019</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>1 Aug 2019</span></div>
<div class="OSrXXb ZE0LJd YsWzw" style="bottom:0px"><span>4 Jun 2009</span></div>
'''
soup= BeautifulSoup(html)
[e.get_text(strip=True) for e in soup.find_all('div', {"class": "OSrXXb ZE0LJd YsWzw"})]