我有一个遵循以下模式的网页:
<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>
...
我想成对地抓取href和date-time属性:[abc/def/gh.com,2020-05-31],[ijk/lmn/op.com,2020-04-30]
我怎么能意识到这一点?
谢谢。
您可以尝试以下操作:
from bs4 import BeautifulSoup
t='''<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>'''
soup=BeautifulSoup(t,"lxml")
aTags=soup.select('a')
data=[]
for aTag in aTags:
timeTag=aTag.select_one('time')
data.append([aTag.get('href'),timeTag['datetime']])
print(data)
您可以使用硒的响应来代替t
。
输出:
[['abc/def/gh.com', '2020-05-31'], ['ijk/lmn/op.com', '2020-04-30']]
您可以使用Python使用find_element_by_xpath()
和get_attribute()
函数,如下所示:
# for the hrefs
urls = [a.get_attribute('href') for a in driver.find_elements_by_xpath('//a[contains(@class, "card cardlisting0")]')]
# for the datetimes
dates = [time_element.get_attribute('datetime') for time_element in driver.find_elements_by_xpath('//a//time')]