我希望程序读取书签.html从火狐浏览器导出
from bs4 import BeautifulSoup
import time, re
f = open(r'D:/TestCode/bookmarks.html','r',encoding="utf8")
soup = BeautifulSoup(f.read(),"lxml")
f.close()
dl = []
for i in soup.findAll("dl"):
dl.append(i)
for j in range(len(dl)):
if dl[j].contents[0].has_attr('href') and dl[j].contents[0].has_attr('add_date'):
uri = dl[j].contents[0]['href']
print(uri)
这是火狐浏览器导出书签的一些示例
<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
It will be read and overwritten.
DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>bookmark menu</H1>
<DL><p>
<DT><H3 ADD_DATE="1517201918" LAST_MODIFIED="1550415410">Mozilla Firefox</H3>
<DL><p>
<DT><A HREF="https://support.mozilla.org/th/products/firefox" ADD_DATE="1545397135" LAST_MODIFIED="1545397135">help</A>
</DL><p>
<DT><H3 ADD_DATE="1395221079" LAST_MODIFIED="1550979714">Other</H3>
<DL><p>
<DT>...
</DL>
.
.
.
.
<DT><H3 ADD_DATE="1561105535" LAST_MODIFIED="1561113405">importMobile</H3>
<DL><p>
<DT><A HREF="need this" ADD_DATE="1549779806" LAST_MODIFIED="1561113405"></A>
<DT><A HREF="need this" ADD_DATE="1551437973" LAST_MODIFIED="1561113405"></A>
<DT><A HREF="need this" ADD_DATE="1552966401" LAST_MODIFIED="1561113405"></A>
</DL><p>
.
.
.
</DL>
我尝试在文件夹导入移动下获取 href,但它返回属性错误:"NavigableString"对象没有属性"has_attr">
使用 css 选择器搜索带有属性的锚标记。这应该会给你预期的结果。
from bs4 import BeautifulSoup
data='''<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
It will be read and overwritten.
DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>bookmark menu</H1>
<DL><p>
<DT><H3 ADD_DATE="1517201918" LAST_MODIFIED="1550415410">Mozilla Firefox</H3>
<DL><p>
<DT><A HREF="https://support.mozilla.org/th/products/firefox" ADD_DATE="1545397135" LAST_MODIFIED="1545397135">help</A>
</DL><p>
<DT><H3 ADD_DATE="1395221079" LAST_MODIFIED="1550979714">Other</H3>
<DL><p>
<DT>...
</DL>
.
.
.
.
<DT><H3 ADD_DATE="1561105535" LAST_MODIFIED="1561113405">importMobile</H3>
<DL><p>
<DT><A HREF="need this" ADD_DATE="1549779806" LAST_MODIFIED="1561113405"></A>
<DT><A HREF="need this" ADD_DATE="1551437973" LAST_MODIFIED="1561113405"></A>
<DT><A HREF="need this" ADD_DATE="1552966401" LAST_MODIFIED="1561113405"></A>
</DL><p>
.
.
.
</DL>'''
soup=BeautifulSoup(data,'lxml')
item=soup.find_all('dl')[3]
for tag in item.select('a[href][add_date]'):
print(tag['href'])
输出
need this
need this
need this
检查所有具有属性a
的 dt 并像这样add_date
:
from bs4 import BeautifulSoup
f = open(r'abc.html') #Change with your path
soup = BeautifulSoup(f.read(),"lxml")
f.close()
dl = soup.findAll("dt")
for j in dl:
if j.find('a') != None and j.find('a')['add_date'] != None :
uri = j.find('a')['href']
print uri
输出:
https://support.mozilla.org/th/products/firefox
need this
need this
need this